---
title: "Exercise 8.17"
author: "Per August Jarval Moen"
date: "2024"
output: pdf_document
---
\subsection*{a)}
Let $Y_i$ be the proportion of successful shots by Ray Allen during game $i$, and let $n_i$ be the total number of shot attempts during game $i$. We assume the data originate from an intercept-only logistic regression model, i.e. that
$$Y_i \overset{\text{ind}}{\sim}\frac{1}{n_i}\text{Bin}(n_i, \pi_i),$$
where $\text{logit}{\pi_i} = \beta_0$ for all $i=1,\dots,24$ for some unknown $\beta_0 \in \mathbb{R}$. We load the data and fit the model:
```{r }
url = "http://users.stat.ufl.edu/~aa/glm/data/Basketball.dat"
data = read.table(url, header=TRUE)
data$y= data$made/data$attempts
logistic.model = glm(y~1, family=binomial(link="logit"), weights = attempts, data=data)
```
\textbf{Remark.} As noted before, when fitting a logistic regression model in the above fashion (using success proportions) we have to specify \textit{weights = attempts} so that R knows the number of attempts in each Binomial trial. 
\newline
\newline
```{r}
summ = summary(logistic.model)
beta.0.hat = summ$coefficients[1,1]
est.success.prob = exp(beta.0.hat)/(1+exp(beta.0.hat))
ci = exp(beta.0.hat + 1.96*summ$coefficients[1,2] *c(-1,1))/(1+exp(beta.0.hat +
      1.96*summ$coefficients[1,2] *c(-1,1)))
```
The estimated success probability is $\text{expit}(\hat{\beta}_0)$\footnote{The function $\text{expit}$ is given by $\text{expit}(x) = \frac{e^x}{1+e^x}$.} = `r est.success.prob` and an approximate 95\% confidence interval for this probability is $\text{expit}(\hat{\beta}_0 \pm 1.96 \cdot \hat{\text{se}}(\hat{\beta}_0)) =$ (`r ci[1]` , `r ci[2]`).
\subsection*{b)}
It is reasonable to assume that the skill level of the opposing basketball team affects the probability of a successful shot. The skill level of the opposing team is not included in the linear predictor. As the variance of the response in a logistic regression depends on the probability of success, missing an important covariate in the linear predictor should cause overdispersion. We can try to remedy this by using a QL approach. We drop the distributional assumption, and assume instead that
\begin{align*}
\mathbb{E}(Y_i) = \pi_i,
\intertext{where $\text{logit}(\pi_i) = \beta_0$, and, }
\text{Var}(Y_i) = \phi \frac{\pi_i (1-\pi_i)}{n_i},
\end{align*}
where $\phi>0$ models the overdispersion. Notice that the expectation is the same as in the logistic regression model, and that the variance is the same except that it's multiplied by $\phi$. We now fit the Quasi-logistic model. "Under the hood", the QL estimate of $\beta_0$ is computed by R using the quasi-score equations, and the dispersion parameter $\phi$ is estimated the the generalized Pearson statistic. Notice that we specify \textit{variance = "mu(1-mu)"} and \textit{weights=attempts} in the code..
```{r}
QL.model = glm(y  ~ 1 , family = quasi(link = "logit", variance = "mu(1-mu)"),
               weights=attempts, data = data)
summary(QL.model)
```
```{r echo=FALSE}
summ2 = summary(QL.model)
beta.0.hat2 = summ2$coefficients[1,1]
est.success.prob2 = exp(beta.0.hat2)/(1+exp(beta.0.hat2))
ci2 = exp(beta.0.hat2 + 1.96*summ2$coefficients[1,2] *c(-1,1))/(1+exp(beta.0.hat2 +
      1.96*summ2$coefficients[1,2] *c(-1,1)))
```
The point estimate $\hat{\beta}_0$ is the same as before, and thus the estimated success probability is also the same as before. An approximate 95\% CI for the success probability using the Quasi-logistic model is $\text{expit}(\hat{\beta}_0 \pm \sqrt{\hat{\phi}} \cdot 1.96 \cdot \hat{\text{se}}(\hat{\beta}_0)) =$ (`r ci2[1]` , `r ci2[2]`). This is a wider confidence interval than we obtained using the logistic regression model (which does not account for overdispersion).