--- title: "Exercise 8.17" author: "Per August Jarval Moen" date: "2024" output: pdf_document --- \subsection*{a)} Let $Y_i$ be the proportion of successful shots by Ray Allen during game $i$, and let $n_i$ be the total number of shot attempts during game $i$. We assume the data originate from an intercept-only logistic regression model, i.e. that $$Y_i \overset{\text{ind}}{\sim}\frac{1}{n_i}\text{Bin}(n_i, \pi_i),$$ where $\text{logit}{\pi_i} = \beta_0$ for all $i=1,\dots,24$ for some unknown $\beta_0 \in \mathbb{R}$. We load the data and fit the model: ```{r } url = "http://users.stat.ufl.edu/~aa/glm/data/Basketball.dat" data = read.table(url, header=TRUE) data$y= data$made/data$attempts logistic.model = glm(y~1, family=binomial(link="logit"), weights = attempts, data=data) ``` \textbf{Remark.} As noted before, when fitting a logistic regression model in the above fashion (using success proportions) we have to specify \textit{weights = attempts} so that R knows the number of attempts in each Binomial trial. \newline \newline ```{r} summ = summary(logistic.model) beta.0.hat = summ$coefficients[1,1] est.success.prob = exp(beta.0.hat)/(1+exp(beta.0.hat)) ci = exp(beta.0.hat + 1.96*summ$coefficients[1,2] *c(-1,1))/(1+exp(beta.0.hat + 1.96*summ$coefficients[1,2] *c(-1,1))) ``` The estimated success probability is $\text{expit}(\hat{\beta}_0)$\footnote{The function $\text{expit}$ is given by $\text{expit}(x) = \frac{e^x}{1+e^x}$.} = `r est.success.prob` and an approximate 95\% confidence interval for this probability is $\text{expit}(\hat{\beta}_0 \pm 1.96 \cdot \hat{\text{se}}(\hat{\beta}_0)) =$ (`r ci[1]` , `r ci[2]`). \subsection*{b)} It is reasonable to assume that the skill level of the opposing basketball team affects the probability of a successful shot. The skill level of the opposing team is not included in the linear predictor. As the variance of the response in a logistic regression depends on the probability of success, missing an important covariate in the linear predictor should cause overdispersion. We can try to remedy this by using a QL approach. We drop the distributional assumption, and assume instead that \begin{align*} \mathbb{E}(Y_i) = \pi_i, \intertext{where $\text{logit}(\pi_i) = \beta_0$, and, } \text{Var}(Y_i) = \phi \frac{\pi_i (1-\pi_i)}{n_i}, \end{align*} where $\phi>0$ models the overdispersion. Notice that the expectation is the same as in the logistic regression model, and that the variance is the same except that it's multiplied by $\phi$. We now fit the Quasi-logistic model. "Under the hood", the QL estimate of $\beta_0$ is computed by R using the quasi-score equations, and the dispersion parameter $\phi$ is estimated the the generalized Pearson statistic. Notice that we specify \textit{variance = "mu(1-mu)"} and \textit{weights=attempts} in the code.. ```{r} QL.model = glm(y ~ 1 , family = quasi(link = "logit", variance = "mu(1-mu)"), weights=attempts, data = data) summary(QL.model) ``` ```{r echo=FALSE} summ2 = summary(QL.model) beta.0.hat2 = summ2$coefficients[1,1] est.success.prob2 = exp(beta.0.hat2)/(1+exp(beta.0.hat2)) ci2 = exp(beta.0.hat2 + 1.96*summ2$coefficients[1,2] *c(-1,1))/(1+exp(beta.0.hat2 + 1.96*summ2$coefficients[1,2] *c(-1,1))) ``` The point estimate $\hat{\beta}_0$ is the same as before, and thus the estimated success probability is also the same as before. An approximate 95\% CI for the success probability using the Quasi-logistic model is $\text{expit}(\hat{\beta}_0 \pm \sqrt{\hat{\phi}} \cdot 1.96 \cdot \hat{\text{se}}(\hat{\beta}_0)) =$ (`r ci2[1]` , `r ci2[2]`). This is a wider confidence interval than we obtained using the logistic regression model (which does not account for overdispersion).