\documentclass[10pt,UKenglish]{article} \RequirePackage{amsthm,amsmath,amsfonts} \RequirePackage{bm} \usepackage{enumerate} %\usepackage{enumitem} \usepackage{paralist} \usepackage{graphicx} \usepackage{hyperref} \usepackage{color} \newcommand{\E}{\mbox{E}} \newcommand{\sd}{\mbox{sd}} \newcommand{\Var}{\mbox{Var}} \usepackage{url} \usepackage{parskip} \newcommand{\B}{\boldsymbol} \newcommand{\Bb}{\mathbf} \begin{document} \begin{center} \section*{STK3100/4100––Introduction to Generalized Linear Models} \subsection*{Mandatory assignment 2 of 2} \end{center} \subsubsection*{Submission deadline} Thursday November 7 2024, 14:30 in Canvas (\url{canvas.uio.no}). \subsubsection*{Instructions} Note that you have \textbf{one attempt} to pass the assignment. This means that there are no second attempts. You can choose between scanning handwritten notes or typing the solution directly on a computer (for instance with Latex). The assignment must be submitted as \textbf{a single PDF file}. Scanned pages must be clearly legible. The submission must contain your name, course and assignment number. It is expected that you give a clear presentation with all necessary explanations. Remember to include all relevant plots and figures. All aids, including collaboration, are allowed, but the submission must be written by you and reflect your understanding of the subject. If we doubt that you have understood the content you have handed in, we may request that you give an oral account. In exercises where you are asked to write a computer program, you need to hand in the code along with the rest of the assignment. It is important that the submitted program contains a trial run, so that it is easy to see the result of the code. \subsubsection*{Application for postponed delivery} If you need to apply for a postponement of the submission deadline due to illness or other reasons, you have to contact the Student Administration at the Department of Mathematics (e-mail: \href{mailto:studieinfo@math.uio.no}{studieinfo@math.uio.no}) no later than the same day as the deadline. All mandatory assignments in this course must be approved in the same semester, before you are allowed to take the final examination. \subsubsection*{Specifically about this assignment} In order to get the assignment accepted you need to fulfil the following requirements: \begin{itemize} \item Made a real attempt on all (sub-)questions \item Give satisfactory answers in at least 60$\%$ of the (sub-)questions \item Include relevant R outputs in your report. \end{itemize} \subsubsection*{Complete guidelines about delivery of mandatory assignments:} \url{www.uio.no/english/studies/admin/compulsary-activities/mn-math-mandatory.html} \begin{center} GOOD LUCK! \end{center} \subsubsection*{Problem 1} In this problem we will look at data on the number of claims during one year in a portfolio of insured cars from an English insurance company. The number of claims are registered according to the age of the insured (given in four age groups), engine volume of the car (given in four volume groups), and the district where the car is insured (four districts). You may read the data into R by the commands: \begin{verbatim} data="http://www.uio.no/studier/emner/matnat/math/STK3100/data/claims.txt" claims=read.table(data,header=T) \end{verbatim} The data file consists of one line for each of the 64 combinations of age group, volume group and district, and with the following variables in the five columns: \begin{itemize} \item \texttt{alder}: age of policyholder (1 = below 25 years; 2 = 25–29 years; 3 = 30–35 years; 4 = over 35 years) \item \texttt{motorvolum}: Engine volume (1 = below 1 litre; 2 = 1–1.5 litres; 3 = 1.5–2 litres; 4 = over 2 litres). \item \texttt{distrikt}: District: (4 = London and other large cities; 1–3 = other districts). \item \texttt{antforsikret}: Number of insured cars. \item \texttt{antskader}: Number of claims. \end{itemize} We will assume that the number of claims (\texttt{antskader}) is Poisson distributed within each of the 64 combinations of age group, volume group and district. \begin{enumerate}[a)] \item Expain why this may be a reasonable assumption. \end{enumerate} We will use a GLM for Poisson data with logarithmic link function and age of the policyholder (\texttt{alder}), engine volume (\texttt{motorvolum}) and district (\texttt{distrikt}) as categorical covariates (factors). \begin{enumerate}[a)]\addtocounter{enumi}{1} \item Explain why you should use the logarithm of the number of insured cars (\texttt{antforsikret}) as an offset. \item Perform an analysis that clarifies the significance of age, engine volume, and district and any potential interactions between these factors. Which of the models you have considered seems to give the best description of the data? \item Make some informative plots of the residuals for "the best model" from question c). Are there any patterns in the plots which suggest that the model fit is not satisfactory? \item Interpret the estimates from "the best model" in question c as rate ratios, and give $95\%$ confidence intervals for the rate ratios. \item Estimate the claim rate of an insured person in age category 25–29 years who has a car with engine volume 1.5–2 liter and lives in London. Also give a $95\%$ confidence interval for this rate. \end{enumerate} \subsubsection*{Problem 2} The Poisson distribution has variance equal to the mean. In practice this assumption is often unrealistic for count data, because the variability is in fact greater than can be described by the Poisson mean. This is what we call \textit{overdispersion}. A common way to handle overdispersed count data is to use a type of mixture of Poisson distributions, which results in the negative binomial distribution. In this problem we will consider some properties of the negative binomial distribution and the corresponding GLMs. As shown in the lectures, the negative binomial distribution may be obtained as a mixture of Poisson distributions. More specifically, if $\Lambda$ is a random variable that follows the gamma distribution with pdf \[ f(\gamma;\mu,k) = \frac{(k/\mu)^{k}}{\Gamma(k)}\lambda^{k-1}e^{-k\lambda/\mu}, \quad \lambda > 0 \] and further, the random variable $Y$, given $\Lambda=\lambda$, is Possion distributed with parameter $\lambda$, and thus has the conditional pmf \[ p(y|\lambda) = \frac{\lambda^{y}}{y!}e^{-\lambda}, \quad y=0,1,2,\ldots \] Then, the marginal pmf of $Y$ is given by \begin{equation} p(y;\mu,k) = \frac{\Gamma(y+k)}{\Gamma(k)\Gamma(y+1)}\left(\frac{k}{\mu+k}\right)^{k}\left(\frac{\mu}{\mu+k}\right)^{y}, \quad y=0,1,2,\ldots, \label{eqn:pmf.negbin} \end{equation} which is the pmf of the negative binomial distribution. We will now assume that $k > 0$ is a given constant, and consider the random variable $Y^{*}=Y/k$. Then $P(Y^{*}=y^{*})=P(Y=ky^{*})$, for $y^{*}=0,\frac{1}{k},\frac{2}{k},\ldots$, so $Y^{*}$ has pmf \begin{equation} p(y^{*};\mu,k) = \frac{\Gamma(ky^{*}+k)}{\Gamma(k)\Gamma(ky^{*}+1)}\left(\frac{k}{\mu+k}\right)^{k}\left(\frac{\mu}{\mu+k}\right)^{ky^{*}}, \quad y=0,\frac{1}{k},\frac{2}{k},\ldots. \label{eqn:pmf.negbin.2} \end{equation} \begin{enumerate}[a)] \item Show that \eqref{eqn:pmf.negbin.2} is a distribution in the exponential dispersion family. That is, show that \eqref{eqn:pmf.negbin.2} can be written on the form $\exp((\theta y^{*}-b(\theta))/a(\phi)+c(y;\phi))$, with $a(\phi) = 1/k$, and determine $\theta$ and $b(\theta)$. \item Find the mean and variance of $Y^{*}$ using the relations (4.3) and (4.4) in the text book. Use these results to show that $\E(Y) = \mu$ and determine $\Var(Y)$. \end{enumerate} Then we assume that $Y_{1},\ldots,Y_{n}$ are independent and have pmf of the form \eqref{eqn:pmf.negbin}, and that their means $\mu_{i}=\E(Y_{i})$ are specified via a link function $g$, i.e. $g(\mu_{i})=\eta_{i}=\sum_{j}\beta_{j}x_{ij}$. \begin{enumerate}[a)]\addtocounter{enumi}{2} \item Derive an expression for the log-likelihood function $L(\boldsymbol{\mu}, k; \mathbf{y})$. (In the text book, there is an expression of the log-likelihood for the parameterisation with $\gamma=1/k$. You should express it in terms of $k$.) \item For a given $k > 0$, the deviance for a negative binomial GLM is given by $D(\mathbf{y}, \hat{\boldsymbol{\mu}}) = 2(L(\mathbf{y}, k; \mathbf{y})- L(\hat{\boldsymbol{\mu}}, k; \mathbf{y}))$. Derive an expression for $D(\mathbf{y}, \hat{\boldsymbol{\mu}})$. \item Derive the limit of the deviance when $k\rightarrow \infty$. How can you explain this result? \end{enumerate} We will now return to the insurance data from Problem 1, where it was assumed that the Poisson distribution was a good fit, and hence, that there was no overdispersion. \begin{enumerate}[a)]\addtocounter{enumi}{5} \item Fit the your prefered GLM from Problem 1 c), substituting the Poisson distribution with the negative binomial (this is done using the command $glm.nb$ from the \texttt{MASS} package, see the R code on the horseshoe crab data from the lecture on October 21). Does it provide a better fit than the Poisson GLM? What does the estimated $k$ (called $\theta$ in the R output) tell you about possible over-dispersion, and how do you see that in light of your response to Problem 1 a)? \end{enumerate} \end{document}