\documentclass[10pt,UKenglish]{article}

\RequirePackage{amsthm,amsmath,amsfonts}
\RequirePackage{bm}

\usepackage{enumerate}
%\usepackage{enumitem}
\usepackage{paralist}
\usepackage{graphicx}
\usepackage{hyperref}

\usepackage{color}
\newcommand{\E}{\mbox{E}}
\newcommand{\sd}{\mbox{sd}}
\newcommand{\Var}{\mbox{Var}}

\usepackage{url}
\usepackage{parskip}

\newcommand{\B}{\boldsymbol}
\newcommand{\Bb}{\mathbf}

\begin{document}

\begin{center}
\section*{STK3100/4100––Introduction to Generalized Linear Models}
\subsection*{Mandatory assignment 2 of 2}
\end{center}

\subsubsection*{Submission deadline}

Thursday November 7 2024, 14:30 in Canvas (\url{canvas.uio.no}).

\subsubsection*{Instructions}

Note that you have \textbf{one attempt} to pass the assignment. This means that 
there are no second attempts. You can choose between scanning handwritten notes 
or typing the solution directly on a computer (for instance with Latex). The assignment 
must be submitted as \textbf{a single PDF file}. Scanned pages must be clearly legible. 
The submission must contain your name, course and assignment number.

It is expected that you give a clear presentation with all necessary explanations. Remember 
to include all relevant plots and figures. All aids, including collaboration, are allowed, but 
the submission must be written by you and reflect your understanding of the subject. If we 
doubt that you have understood the content you have handed in, we may request that you 
give an oral account.

In exercises where you are asked to write a computer program, you need to hand in the code 
along with the rest of the assignment. It is important that the submitted program contains a 
trial run, so that it is easy to see the result of the code.

\subsubsection*{Application for postponed delivery}

If you need to apply for a postponement of the submission deadline due to illness or other 
reasons, you have to contact the Student Administration at the Department of Mathematics 
(e-mail: \href{mailto:studieinfo@math.uio.no}{studieinfo@math.uio.no}) no later than the 
same day as the deadline. All mandatory assignments in this course must be approved in the 
same semester, before you are allowed to take the final examination.

\subsubsection*{Specifically about this assignment}

In order to get the assignment accepted you need to fulfil the following requirements:
\begin{itemize}
\item Made a real attempt on all (sub-)questions
\item Give satisfactory answers in at least 60$\%$ of the (sub-)questions
\item Include relevant R outputs in your report.
\end{itemize}

\subsubsection*{Complete guidelines about delivery of mandatory assignments:}
\url{www.uio.no/english/studies/admin/compulsary-activities/mn-math-mandatory.html}

\begin{center}
GOOD LUCK!
\end{center}  

\subsubsection*{Problem 1}

In this problem we will look at data on the number of claims during one year in a portfolio of insured cars from an English insurance company. The number of claims are registered according to the age of the insured (given in four age groups), engine volume of the car (given in four volume groups), and the district where the car is insured (four districts).

You may read the data into R by the commands:

\begin{verbatim}
data="http://www.uio.no/studier/emner/matnat/math/STK3100/data/claims.txt"
claims=read.table(data,header=T)
\end{verbatim}

The data file consists of one line for each of the 64 combinations of age group, volume group and district, and with the following variables in the five columns:
\begin{itemize}
\item \texttt{alder}: age of policyholder (1 = below 25 years; 2 = 25–29 years; 3 = 30–35 years; 4 = over 35 years)
\item \texttt{motorvolum}: Engine volume (1 = below 1 litre; 2 = 1–1.5 litres; 3 = 1.5–2 litres; 4 = over 2 litres).
\item \texttt{distrikt}: District: (4 = London and other large cities; 1–3 = other districts).
\item \texttt{antforsikret}: Number of insured cars.
\item \texttt{antskader}: Number of claims.
\end{itemize}

We will assume that the number of claims (\texttt{antskader}) is Poisson distributed within 
each of the 64 combinations of age group, volume group and district.
\begin{enumerate}[a)]
\item Expain why this may be a reasonable assumption.
\end{enumerate}

We will use a GLM for Poisson data with logarithmic link function and age of the policyholder 
(\texttt{alder}), engine volume (\texttt{motorvolum}) and district (\texttt{distrikt}) as 
categorical covariates (factors).
\begin{enumerate}[a)]\addtocounter{enumi}{1}
\item Explain why you should use the logarithm of the number of insured cars 
(\texttt{antforsikret}) as an offset.
\item Perform an analysis that clarifies the significance of age, engine volume, and district 
and any potential interactions between these factors. Which of the models you have 
considered seems to give the best description of the data?
\item Make some informative plots of the residuals for "the best model" from question c).
Are there any patterns in the plots which suggest that the model fit is not satisfactory?
\item Interpret the estimates from "the best model" in question c as rate ratios, and give 
$95\%$ confidence intervals for the rate ratios.
\item Estimate the claim rate of an insured person in age category 25–29 years who has a 
car with engine volume 1.5–2 liter and lives in London. Also give a $95\%$ confidence
interval for this rate.
\end{enumerate}


\subsubsection*{Problem 2}
The Poisson distribution has variance equal to the mean. In practice this assumption is often 
unrealistic for count data, because the variability is in fact greater than can be described by 
the Poisson mean. This is what we call \textit{overdispersion}. A common way to handle 
overdispersed count data is to use a type of mixture of Poisson distributions, which results in 
the negative binomial distribution. In this problem we will consider some properties of the 
negative binomial distribution and the corresponding GLMs. As shown in the lectures, 
the negative binomial distribution may be obtained as a mixture of Poisson distributions.

More specifically, if $\Lambda$ is a random variable that follows the gamma distribution
with pdf
\[
f(\gamma;\mu,k) = \frac{(k/\mu)^{k}}{\Gamma(k)}\lambda^{k-1}e^{-k\lambda/\mu}, \quad \lambda > 0
\]

and further, the random variable $Y$, given $\Lambda=\lambda$, is Possion distributed
with parameter $\lambda$, and thus has the conditional pmf
\[
p(y|\lambda) = \frac{\lambda^{y}}{y!}e^{-\lambda}, \quad y=0,1,2,\ldots
\]
Then, the marginal pmf of $Y$ is given by
\begin{equation}
p(y;\mu,k) = \frac{\Gamma(y+k)}{\Gamma(k)\Gamma(y+1)}\left(\frac{k}{\mu+k}\right)^{k}\left(\frac{\mu}{\mu+k}\right)^{y}, \quad y=0,1,2,\ldots,
\label{eqn:pmf.negbin}
\end{equation}
which is the pmf of the negative binomial distribution. 

We will now assume that $k > 0$ is a given constant, and consider the random variable
$Y^{*}=Y/k$. Then $P(Y^{*}=y^{*})=P(Y=ky^{*})$, for $y^{*}=0,\frac{1}{k},\frac{2}{k},\ldots$, so $Y^{*}$ has pmf
\begin{equation}
p(y^{*};\mu,k) = \frac{\Gamma(ky^{*}+k)}{\Gamma(k)\Gamma(ky^{*}+1)}\left(\frac{k}{\mu+k}\right)^{k}\left(\frac{\mu}{\mu+k}\right)^{ky^{*}}, \quad y=0,\frac{1}{k},\frac{2}{k},\ldots.
\label{eqn:pmf.negbin.2}
\end{equation}

\begin{enumerate}[a)]
\item Show that \eqref{eqn:pmf.negbin.2} is a distribution in the exponential dispersion 
family. That is, show that \eqref{eqn:pmf.negbin.2} can be written on the form 
$\exp((\theta y^{*}-b(\theta))/a(\phi)+c(y;\phi))$, with $a(\phi) = 1/k$, and determine 
$\theta$ and $b(\theta)$.
\item Find the mean and variance of $Y^{*}$ using the relations (4.3) and (4.4) in the text 
book. Use these results to show that $\E(Y) = \mu$ and determine $\Var(Y)$.
\end{enumerate}
Then we assume that $Y_{1},\ldots,Y_{n}$ are independent and have pmf of the form
\eqref{eqn:pmf.negbin}, and that their means $\mu_{i}=\E(Y_{i})$ are specified via a link 
function $g$, i.e. $g(\mu_{i})=\eta_{i}=\sum_{j}\beta_{j}x_{ij}$.

\begin{enumerate}[a)]\addtocounter{enumi}{2}
\item Derive an expression for the log-likelihood function 
$L(\boldsymbol{\mu}, k; \mathbf{y})$. (In the text book, there is an expression of the 
log-likelihood for the parameterisation with $\gamma=1/k$. You should express it in terms 
of $k$.)
\item For a given $k > 0$, the deviance for a negative binomial GLM is given by 
$D(\mathbf{y}, \hat{\boldsymbol{\mu}}) = 2(L(\mathbf{y}, k; \mathbf{y})- L(\hat{\boldsymbol{\mu}}, k; \mathbf{y}))$. 
Derive an expression for $D(\mathbf{y}, \hat{\boldsymbol{\mu}})$.
\item Derive the limit of the deviance when $k\rightarrow \infty$. How can you explain this result?
\end{enumerate}

We will now return to the insurance data from Problem 1, where it was assumed that
the Poisson distribution was a good fit, and hence, that there was no overdispersion.
\begin{enumerate}[a)]\addtocounter{enumi}{5}
\item Fit the your prefered GLM from Problem 1 c), substituting the Poisson distribution
with the negative binomial (this is done using the command $glm.nb$ from the 
\texttt{MASS} package, see the R code on the horseshoe crab data from the lecture on 
October 21). Does it provide a better fit than the Poisson GLM? What does the estimated
$k$ (called $\theta$ in the R output) tell you about possible over-dispersion, and how
do you see that in light of your response to Problem 1 a)?
\end{enumerate}


\end{document}