--- title: "Exercise 5.13" author: "Jonas Moss & Per August Moen" date: "Fall 2024" output: pdf_document: default html_document: default --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) ``` \section*{Exercise 5.13} In this exercise we will look at the residuals of a logistic regression model. The main motivation for considering residuals is to assess the model fit to the data. Keep this in mind when doing the exercise. \newline \newline We consider the logistic regression model $$ Y_i \overset{\text{ind}}{\sim} \frac{1}{n_i}\text{Bin}(n_i, \pi_i),$$ for $i \in [n]$, where $g(\pi_i) = \beta_0 + \beta_1 x_i$ and $g(\cdot) = \text{logit}(\cdot)$ (canonical link). We assume that $\beta_0 = -2$ and $\beta_1 = 0.04$. \newline \newline We also assume the $x_i$'s are drawn independently and uniformly from the interval between $0$ and $100$. For simplicity we take $n_i = 1$ for all $i$. \newline \newline Let's generate data from the model and plot the raw residuals $\hat{\epsilon}_i := Y_i - \hat{\mu}_i$ against the $x_i$'s and against the fitted values. This will show us how the residuals ought to look like (i.e. when we have a good model fit) ```{r cache=TRUE} set.seed(420) #so that we get the same "random" numbers each time n = 100 min = 0 max = 100 beta0 = -2 beta1 = 0.04 #generating data xs = runif(n=n, min=min, max=max) kappas = beta0 + beta1 * xs pi_s = exp(kappas) / (1+exp(kappas)) # pi_i = logit^{-1} (kappa_i) ys = rbinom(n=n, size = rep(1, n), prob = pi_s) # size argument is vector of (n_1, ..., n_n) #fitting GLM: fit.glm = glm(ys~xs, family=binomial(link="logit")) #plot 1: residuals vs fitted fitted_vals = fitted(fit.glm) residuals = ys-fitted_vals ``` We first plot residuals against the fitted values. To make the plot easier to understand, we also plot $\frac{1}{2}$ minus the fitted values (in red): ```{r} plot(xs, residuals, main = "") points(xs, 0.5-fitted_vals, col=2) ``` What do we see from the plot? Well, given that $\hat{\pi}_i$ estimates $\pi_i$ reasonably well, each residual $\hat{\epsilon}_i = Y_i - \hat{\pi}_i$ is approximately distributed as a mean-centered Bernoulli rv. Each $\hat{\epsilon}_i$ can either be above $\hat{\pi}_i$ by $1-\hat{\pi}_i$ or below by $\hat{\pi}_i$. Hence, $\hat{\epsilon}$ will differ from $\frac{1}{2}-\hat{\pi}_i$ by $\pm\frac{1}{2}$. Formally, we will have that $$ Y_i - \hat{\mu_i} \overset{d}{\approx} \frac{1}{2} - \pi_i + U_i, $$ where $U_i =\frac{1}{2}$ with probability $\pi_i$ and $U_i = -\frac{1}{2}$ with probability $1-\pi_i$. \newline \newline Let's plot the raw residuals against the fitted values. As above, we also plot $\frac{1}{2}$ minus the fitted values (in red): ```{r} plot(fitted_vals, residuals, main = "") points(fitted_vals, 1/2-fitted_vals,col=2) ``` \leavevmode \newline Since we are plotting against the fitted values, the residuals now become two parallel lines.