---
title: "Exercises day 8"
author: "Alfonso Diz-Lois"
format: pdf
editor: visual
---

## Exercise 21: Fitting Poisson models

For each of the three groups fit Poisson model, compute "expected" tables and the coefficient of dispersion.

```{r}

# Clean up the memory before we start.
rm(list=ls(all=TRUE))

# The following function may be used to calculate quantities mentioned in 
# Slide 8 - 14, Lecture 8
poisson.expected.table.func = function(observed.frequency) {
  
  # Number of total observations.
  n = sum(observed.frequency)
  # Frequency.
  y = 0:(length(observed.frequency)-1)
  
  # y.bar, as defined in Slide 8, Lecture 8.
  mean.y = sum(y*observed.frequency)/n
  # s^2, as defined in Slide 8, Lecture 8.
  var.y = ((n-1)^(-1))*sum(observed.frequency*(y-mean.y)^2)
  # Coefficient of dispersion (CD), as defined in Slide 8, Lecture 8.
  CD = var.y/mean.y
  # Degrees of freedom, as defined in Slide 9, Lecture 8.
  df = length(observed.frequency) - 2
  
  # Poisson denisty, as defined in Slide 3, Lecture 8.
  prob.y = dpois(x=0:df, lambda=mean.y)
  
  # Note that "1 - ppois(q=df, lambda=mean.y)" and "1 - sum(prob.y)" should give
  # exact the same result. Why?
  
  # Aggregate the frequencies that are equal to or higher than
  # the highest observed number of frequency, as mentioned in Slide 10, Slide 11, Lecture 8.
  prob.y = c(prob.y, 1 - ppois(q=df, lambda=mean.y))
  # Alternative:
  #prob.y = c(prob.y, 1 - sum(prob.y))
  
  # Compute the expected number of frequency (E_j) by using the definition on Slide 10, Lecture 8.
  # (Recognize Poisson density from Slide 3, Lecture 10.)
  expected.frequency = n*prob.y
  
  # Summarize the result into a matrix.
  result.matrix = cbind(
    # Frequency.
    y,
    # Observed number of frequency.
    observed.frequency,
    # Expected number of frequency.
    expected.frequency,
    # Difference between observed number of frequency and expected number of frequency.
    observed.frequency-expected.frequency,
    # Relative difference between observed number of frequency and expected number of frequency.
    # This is the term inside the summation part of Pearson chi-squared statistic, in Slide 10, Lecture 8.
    (observed.frequency-expected.frequency)^2/expected.frequency
  )
  result.matrix = as.data.frame(result.matrix)
  # Give proper names to the columns.
  colnames(result.matrix) = c("y","Ob.freq","Ex.freq","O-E","(O-E)^2/E")
  # Round off some decimals
  result.matrix = round(x=result.matrix, digits=2)
  
  # Compute Pearson chi-squared statistic, as defined in Slide 10, Lecture 8.
  Chi.sq = sum(result.matrix[,"(O-E)^2/E"])
  # Compute the p-value that belongs to the Pearson chi-squared statistic.
  p.val = 1 - pchisq(q=Chi.sq, df=df)
  # what does "pchisq()" do?
  # Hint: Use "?pchisq"
  # Why should we use "1 - pchisq(q=Chi.sq, df=df)" and not "pchisq(q=Chi.sq, df=df)"?
  
  # Print some results.
  cat("mean.y = ", mean.y, "\n", sep="")
  cat("var.y = ", var.y, "\n", sep="")
  cat("CD = ", CD, "\n", sep="")
  cat("Chi.sq = ", Chi.sq, "\n", sep="")
  cat("p-value = ", p.val, "\n", sep="")
  cat("\n")
  
  # Return the result matrix.
  return(result.matrix)
}


# Use the function:
yeast = c(75,103,121,54,30,17)
poisson.expected.table.func(observed.frequency=yeast)

```

Aggregating the last categories:

```{r}

moss<-c(100,9,6,11)
poisson.expected.table.func(observed.frequency=moss)

azuki<-c(61,51)
poisson.expected.table.func(observed.frequency=azuki)
```

## Exercise 22: Poisson regression – cancer data

The dataset cancer gives lung cancer cases and person-years at risk for British physicians. The data are from a study of lung cancer incidence as a function of cigarette smoking, and we will use it to illustrate Poisson regression.

### a

Perform Poisson regression on the cancer data with number of cases being the response variable and years of cigarette smoking and number of cigarettes smoked per day being covariates

```{r}
# Clean up the memory before we start.
rm(list=ls(all=TRUE))

# Read data.
cancer.data = read.table(file="http://www.uio.no/studier/emner/matnat/math/STK4900/data/cancer.txt", header=FALSE)
names(cancer.data)=c("age","cig","pyr","cancer")
# Take a look at the data.
sum(is.na(cancer.data))
head(cancer.data)
```

```{r}
# We first consider the model  E(Y) = n*exp{b0+b1*s+b2*a}, where
#     Y = number of cancer cases (= cancer),
#     n = number of person years (= pyr),
#     s = number of cigarettes smoked per day (= cig)
#     a = age in years (= age)
# We may write the model on the form  E(Y)= exp{1*log(n)+b0+b1*s+b2*a}.

# Note that log(n) appears as a sort of "covariate" where we know that the regression coefficient takes the value 1.
# This is called an OFFSET.

# We fit the model and look at the result.
glm.obj.1 = glm(cancer~offset(log(pyr))+age+cig, data=cancer.data, family=poisson)
summary(glm.obj.1)
```

It is common to report the results of a Poisson regression by means of the rate ratio RR = exp(beta) with confidence limits. To this end we may use the function "exp.coef.func()" from Exercise 18. Use the function to compute rate ratios for age and number of cigarettes:

```{r}
exp.coef.func = function(glm.obj) {
  
  alpha = 0.05
  coef.mat = summary(glm.obj)$coef
  
  lower = exp(coef.mat[,1] - qnorm(p=1-alpha/2)*coef.mat[,2])
  upper = exp(coef.mat[,1] + qnorm(p=1-alpha/2)*coef.mat[,2])
  result = cbind(estimate=exp(coef.mat[,1]), lower, upper)
  
  return(result)
}

exp.coef.func(glm.obj.1)

```

### c

We then look at a model with second order terms and interaction:

```{r}
glm.obj.2 = glm(cancer~offset(log(pyr))+age+I(age^2)+cig+I(cig^2)+age:cig, data=cancer.data, family=poisson)
```

Reduce the model by (step-wise) eliminating non-significant covariates. (Use Wald tests from the summary-command and/or deviances from the anova-command.) Discuss the interpretation of your "final model".

ADDITIONAL QUESTION: Age and the number of cigarettes smoked are reported in intervals. We may alternatively consider these covariates as categorical. Such a model is fitted by the command:

```{r}
glm.obj.3 = glm(cancer~offset(log(pyr))+factor(age)+factor(cig), data=cancer.data, family=poisson)
summary(glm.obj.3)
# Give an interpretation of this model.
# Discuss how the model may be used to assess the fit of your "final model" from question 3.
```

One way to compare both models is to use the AIC, which happens to be lower for the categorical model.