---
title: "Exercise 5.13"
author: "Jonas Moss & Per August Moen"
date: "Fall 2024"
output:
  pdf_document: default
  html_document: default
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
\section*{Exercise 5.13}
In this exercise we will look at the residuals of a logistic regression model. The main motivation for considering residuals is to assess the model fit to the data. Keep this in mind when doing the exercise. 
\newline
\newline
We consider the logistic regression model 
$$
Y_i \overset{\text{ind}}{\sim} \frac{1}{n_i}\text{Bin}(n_i, \pi_i),$$
for $i \in [n]$, where $g(\pi_i) = \beta_0 + \beta_1 x_i$ and $g(\cdot) = \text{logit}(\cdot)$ (canonical link). We assume that $\beta_0 = -2$ and $\beta_1 = 0.04$. 
\newline
\newline
We also assume the $x_i$'s are drawn independently and uniformly from the interval between $0$ and $100$. For simplicity we take $n_i = 1$ for all $i$. 
\newline
\newline
Let's generate data from the model and plot the raw residuals $\hat{\epsilon}_i := Y_i - \hat{\mu}_i$ against the $x_i$'s and against the fitted values. This will show us how the residuals ought to look like (i.e. when we have a good model fit)
```{r cache=TRUE}
set.seed(420) #so that we get the same "random" numbers each time
n = 100 
min = 0 
max = 100
beta0 = -2
beta1 = 0.04

#generating data
xs = runif(n=n, min=min, max=max)
kappas = beta0 + beta1 * xs
pi_s = exp(kappas) / (1+exp(kappas)) # pi_i = logit^{-1} (kappa_i)
ys = rbinom(n=n, size = rep(1, n), prob = pi_s) # size argument is vector of (n_1, ..., n_n)

#fitting GLM: 
fit.glm = glm(ys~xs, family=binomial(link="logit"))

#plot 1: residuals vs fitted
fitted_vals = fitted(fit.glm)
residuals = ys-fitted_vals
```
We first plot residuals against the fitted values. To make the plot easier to understand, we also plot $\frac{1}{2}$ minus the fitted values (in red):
```{r}
plot(xs, residuals, main = "")
points(xs, 0.5-fitted_vals, col=2)
```
What do we see from the plot? Well, given that $\hat{\pi}_i$ estimates $\pi_i$ reasonably well, each residual $\hat{\epsilon}_i = Y_i - \hat{\pi}_i$ is approximately distributed as a mean-centered Bernoulli rv. Each $\hat{\epsilon}_i$ can either be above $\hat{\pi}_i$ by $1-\hat{\pi}_i$ or below by $\hat{\pi}_i$. Hence, $\hat{\epsilon}$ will differ from $\frac{1}{2}-\hat{\pi}_i$ by $\pm\frac{1}{2}$. Formally, we will have that
$$
Y_i - \hat{\mu_i} \overset{d}{\approx} \frac{1}{2} - \pi_i + U_i,
$$
where $U_i =\frac{1}{2}$ with probability $\pi_i$ and $U_i = -\frac{1}{2}$ with probability $1-\pi_i$. 
\newline
\newline
Let's plot the raw residuals against the fitted values. As above, we also plot $\frac{1}{2}$ minus the fitted values (in red):
```{r}
plot(fitted_vals, residuals, main = "")
points(fitted_vals, 1/2-fitted_vals,col=2)
```
\leavevmode
\newline
Since we are plotting against the fitted values, the residuals now become two parallel lines.