---
title: "Additional Exercise 19"
author: "Per August Jarval Moen"
date: "5/10/2022"
output: pdf_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

\section*{Additional exercise 19}
We begin by loading in the data:
```{r cache=TRUE}
data="http://www.uio.no/studier/emner/matnat/math/STK3100/data/car.txt"
car = read.table(data,header=T,sep=",")
car0 = car[car$claimcst0>0,]
car0$agecat = as.factor(car0$agecat)
car0$gender = as.factor(car0$gender)
car0$area = as.factor(car0$area)
```
<!-- \leavevmode \newline \newline -->
Notice that the response variable is \textbf{claimsct0} (the claim amounts).
\subsection*{a)}
We fit a GLM
$$Y_i \overset{\text{ind}}{\sim} \text{InverseGaussian}(\mu_i, \sigma^2),$$
where $g(\mu_i) = x_i^{\text{T}} \beta$ for some $\beta$, $g(\cdot) = \log(\cdot)$ for $i \in [n]$. \newline
```{r cache=TRUE}
fit = glm(claimcst0~agecat+gender+area,data=car0,
family=inverse.gaussian(link="log"))
summary(fit)
```
\subsection*{b)}
Let's perform a Wald test to test whether the coefficient for gender is statistically significant. Under the null hypothesis that $\beta_{\text{genderM}}=0$, we should have that 
$$
\frac{\widehat{\beta}_{\text{genderM}}}{\text{se}(\widehat{\beta}_{\text{genderM}})} \overset{d}{\approx} \text{N}(0,1).
$$
Both $\widehat{\beta}_{\text{genderM}}$ and $\text{se}(\widehat{\beta}_{\text{genderM}})$ can be read from the R output. We can also retrieve them ourselves:
```{r cache=TRUE}
se = summary(fit)$coefficients["genderM", "Std. Error"]
beta.hat.gender = summary(fit)$coefficients["genderM", "Estimate"]
````
<!-- \leavevmode \newline -->
The Wald statistic and an approximate p-value of the test of $H_0 : {\beta}_{\text{genderM}} =0$ versus $H_1 : {\beta}_{\text{genderM}} \neq 0$ can be computed as such:
```{r cache=TRUE}
wald_statistic = beta.hat.gender/se
wald_statistic
pvalue = 2*pnorm(wald_statistic, lower.tail=FALSE)
pvalue
```
The approximate p-value is `r pvalue`, which is quite low. Hence we reject the null hypothesis at confidence level 99.3\%. 
\newline
\newline
Let's compare the Wald test to a Likelihood Ratio test. We begin by fitting the model under the null hypothesis, in which gender is dropped:
```{r cache=TRUE}
fit0 = glm(claimcst0~agecat+area,data=car0,
family=inverse.gaussian(link="log"))
```
Recall that the LR test is given by
$$
Z^2_{\text{LR}} := -2 \{  \underset{H_0}{\sup} \ \ell(\beta) -  \underset{H_1}{\sup} \ \ell(\beta)  \},
$$
and is approximately $\chi_1^2$ distributed under $H_0$. If the above notation is confusing, the LR statistic is minus 2 times the difference between the log-likelihood of the GLM in which gender is not included and the same GLM in which gender is included. We retrieve the log-likelihoods from the fitted models and compute the LR statistic with corresponding approximate p-value:
```{r cache=TRUE}
loglik0 = as.numeric(logLik(fit0))
loglik1 = as.numeric(logLik(fit))
LR_stat = -2 *(loglik0 - loglik1)
LR_stat
pval_LR = pchisq(LR_stat, df=1, lower.tail = FALSE)
pval_LR
```
\leavevmode \newline
The p-value from the Likelihood Ratio test is slightly lower than the p-value from the Wald test. However, the difference is negligible, and for all practical purposes, the tests give the same result. The effect of gender on the claim amount is statistically significant. 

\subsection*{c)}
The issue of using a Wald test to assess the statistical significance of the driver's age on the amount claimed is that the driver's age is a categorical variable with more than two levels. Hence there are several coefficients corresponding to the age of the driver. There exists multivariate Wald tests for testing the significance of several parameters at once, but it is easier to use a Likelihood Ratio test. 

Formally, let $H_0 : \beta_{\text{agecat2}}=\dots =  \beta_{\text{agecat6}}=0$ be our null hypothesis, and let our alternative hypothesis be given by $H_1 : \beta_{\text{agecat }j} \neq 0$ for some $j =2,\dots,6$. Then the LR test statistic is given by
$$
Z^2_{\text{LR}} := -2 \{  \underset{H_0}{\sup} \ \ell(\beta) -  \underset{H_1}{\sup} \ \ell(\beta)  \}.
$$
Under $H_0$, the LR statistic $Z_{\text{LR}}^2$ has a $\chi_5^2$ distribution. The degrees of freedom is $5$ because the model under the alternative has five more free parameters (the categorical variable for age has $6$ levels). Let's compute the test statistic and the corresponding p-value: 

```{r cache=TRUE}
fit0 = glm(claimcst0~gender+area,data=car0,
family=inverse.gaussian(link="log"))

loglik0 = as.numeric(logLik(fit0))
loglik1 = as.numeric(logLik(fit))
LR_stat = -2 *(loglik0 - loglik1)
LR_stat
pval_LR = pchisq(LR_stat, df=5, lower.tail = FALSE)
pval_LR
```
\leavevmode \newline
The p-value is small, and we conclude that the effect of age on the claim amount is statistically significant.