--- title: "Exercises day 8" author: "Alfonso Diz-Lois" format: pdf editor: visual --- ## Exercise 21: Fitting Poisson models For each of the three groups fit Poisson model, compute "expected" tables and the coefficient of dispersion. ```{r} # Clean up the memory before we start. rm(list=ls(all=TRUE)) # The following function may be used to calculate quantities mentioned in # Slide 8 - 14, Lecture 8 poisson.expected.table.func = function(observed.frequency) { # Number of total observations. n = sum(observed.frequency) # Frequency. y = 0:(length(observed.frequency)-1) # y.bar, as defined in Slide 8, Lecture 8. mean.y = sum(y*observed.frequency)/n # s^2, as defined in Slide 8, Lecture 8. var.y = ((n-1)^(-1))*sum(observed.frequency*(y-mean.y)^2) # Coefficient of dispersion (CD), as defined in Slide 8, Lecture 8. CD = var.y/mean.y # Degrees of freedom, as defined in Slide 9, Lecture 8. df = length(observed.frequency) - 2 # Poisson denisty, as defined in Slide 3, Lecture 8. prob.y = dpois(x=0:df, lambda=mean.y) # Note that "1 - ppois(q=df, lambda=mean.y)" and "1 - sum(prob.y)" should give # exact the same result. Why? # Aggregate the frequencies that are equal to or higher than # the highest observed number of frequency, as mentioned in Slide 10, Slide 11, Lecture 8. prob.y = c(prob.y, 1 - ppois(q=df, lambda=mean.y)) # Alternative: #prob.y = c(prob.y, 1 - sum(prob.y)) # Compute the expected number of frequency (E_j) by using the definition on Slide 10, Lecture 8. # (Recognize Poisson density from Slide 3, Lecture 10.) expected.frequency = n*prob.y # Summarize the result into a matrix. result.matrix = cbind( # Frequency. y, # Observed number of frequency. observed.frequency, # Expected number of frequency. expected.frequency, # Difference between observed number of frequency and expected number of frequency. observed.frequency-expected.frequency, # Relative difference between observed number of frequency and expected number of frequency. # This is the term inside the summation part of Pearson chi-squared statistic, in Slide 10, Lecture 8. (observed.frequency-expected.frequency)^2/expected.frequency ) result.matrix = as.data.frame(result.matrix) # Give proper names to the columns. colnames(result.matrix) = c("y","Ob.freq","Ex.freq","O-E","(O-E)^2/E") # Round off some decimals result.matrix = round(x=result.matrix, digits=2) # Compute Pearson chi-squared statistic, as defined in Slide 10, Lecture 8. Chi.sq = sum(result.matrix[,"(O-E)^2/E"]) # Compute the p-value that belongs to the Pearson chi-squared statistic. p.val = 1 - pchisq(q=Chi.sq, df=df) # what does "pchisq()" do? # Hint: Use "?pchisq" # Why should we use "1 - pchisq(q=Chi.sq, df=df)" and not "pchisq(q=Chi.sq, df=df)"? # Print some results. cat("mean.y = ", mean.y, "\n", sep="") cat("var.y = ", var.y, "\n", sep="") cat("CD = ", CD, "\n", sep="") cat("Chi.sq = ", Chi.sq, "\n", sep="") cat("p-value = ", p.val, "\n", sep="") cat("\n") # Return the result matrix. return(result.matrix) } # Use the function: yeast = c(75,103,121,54,30,17) poisson.expected.table.func(observed.frequency=yeast) ``` Aggregating the last categories: ```{r} moss<-c(100,9,6,11) poisson.expected.table.func(observed.frequency=moss) azuki<-c(61,51) poisson.expected.table.func(observed.frequency=azuki) ``` ## Exercise 22: Poisson regression – cancer data The dataset cancer gives lung cancer cases and person-years at risk for British physicians. The data are from a study of lung cancer incidence as a function of cigarette smoking, and we will use it to illustrate Poisson regression. ### a Perform Poisson regression on the cancer data with number of cases being the response variable and years of cigarette smoking and number of cigarettes smoked per day being covariates ```{r} # Clean up the memory before we start. rm(list=ls(all=TRUE)) # Read data. cancer.data = read.table(file="http://www.uio.no/studier/emner/matnat/math/STK4900/data/cancer.txt", header=FALSE) names(cancer.data)=c("age","cig","pyr","cancer") # Take a look at the data. sum(is.na(cancer.data)) head(cancer.data) ``` ```{r} # We first consider the model E(Y) = n*exp{b0+b1*s+b2*a}, where # Y = number of cancer cases (= cancer), # n = number of person years (= pyr), # s = number of cigarettes smoked per day (= cig) # a = age in years (= age) # We may write the model on the form E(Y)= exp{1*log(n)+b0+b1*s+b2*a}. # Note that log(n) appears as a sort of "covariate" where we know that the regression coefficient takes the value 1. # This is called an OFFSET. # We fit the model and look at the result. glm.obj.1 = glm(cancer~offset(log(pyr))+age+cig, data=cancer.data, family=poisson) summary(glm.obj.1) ``` It is common to report the results of a Poisson regression by means of the rate ratio RR = exp(beta) with confidence limits. To this end we may use the function "exp.coef.func()" from Exercise 18. Use the function to compute rate ratios for age and number of cigarettes: ```{r} exp.coef.func = function(glm.obj) { alpha = 0.05 coef.mat = summary(glm.obj)$coef lower = exp(coef.mat[,1] - qnorm(p=1-alpha/2)*coef.mat[,2]) upper = exp(coef.mat[,1] + qnorm(p=1-alpha/2)*coef.mat[,2]) result = cbind(estimate=exp(coef.mat[,1]), lower, upper) return(result) } exp.coef.func(glm.obj.1) ``` ### c We then look at a model with second order terms and interaction: ```{r} glm.obj.2 = glm(cancer~offset(log(pyr))+age+I(age^2)+cig+I(cig^2)+age:cig, data=cancer.data, family=poisson) ``` Reduce the model by (step-wise) eliminating non-significant covariates. (Use Wald tests from the summary-command and/or deviances from the anova-command.) Discuss the interpretation of your "final model". ADDITIONAL QUESTION: Age and the number of cigarettes smoked are reported in intervals. We may alternatively consider these covariates as categorical. Such a model is fitted by the command: ```{r} glm.obj.3 = glm(cancer~offset(log(pyr))+factor(age)+factor(cig), data=cancer.data, family=poisson) summary(glm.obj.3) # Give an interpretation of this model. # Discuss how the model may be used to assess the fit of your "final model" from question 3. ``` One way to compare both models is to use the AIC, which happens to be lower for the categorical model.