---
title: "Exercises day 7"
author: "Alfonso Diz-Lois"
format: pdf
editor: visual
---

## Exercise 18: a second look at logistic regression

In this exercise we will use data from the Western Collaborative Group Study (WCGS), a large epidemiological study designed to study risk factors for coronary heart disease (CHD). More than 3000 men were recruited to the study, and a number of (potential) risk factors were recorded at entry. The men were then followed for about ten years and it was recorded if they developed CHD or not. The WCGS data are used in many of the examples in Section 3.4 and Chapter 6 of the course text book byVittinghoff et al.

You may read the WCGS data into R by the command:

```{r}
wcgs=read.table("http://www.uio.no/studier/emner/matnat/math/STK4900/data/wcgs.txt",sep="\t", header=T,na.strings=".")
```

In this exercise we will restrict ourselves to study the effect of smoking and age on the risk for CHD.

### a

We first fit a logistic regression model with smoke as covariate:

```{r}
fit.smoke=glm(chd69~smoke,data=wcgs,family=binomial)
summary(fit.smoke)
```

Is there a significant effect of smoking on the risk of developing CHD? (Cf. slide 11) Yes

Compute a 95% confidence interval for the regression coefficient for smoking (cf slide 9)

```{r}
lci=fit.smoke$coefficients[2]+qnorm(0.025)*summary(fit.smoke)$coefficients[2,2]
uci=fit.smoke$coefficients[2]+qnorm(0.975)*summary(fit.smoke)$coefficients[2,2]
print(lci)
print(uci)
#Assumed normality
confint.default(fit.smoke)

```

Also estimate the odds ratio for smoking and determine a 95% confidence interval for the odds ratio (cf slide 9).

```{r}

#Odds ratio for smoking (odds ratio for smoking):
paste0(round(exp(fit.smoke$coefficients[2]),3)," (",
        round(exp(lci),3)," - ",round(exp(uci),3))

```

### b

In order to estimate the odds ratio with 95% confidence limits from a fitted logistic model, you may use the function (cf slide 10)

```{r}
expcoef=function(glmobj)

{
regtab=summary(glmobj)$coef
expcoef=exp(regtab[,1])
lower=expcoef*exp(-1.96*regtab[,2])
upper=expcoef*exp(1.96*regtab[,2])
cbind(expcoef,lower,upper)
}
```

Use this function to estimate the odds ratio for smoking with 95% confidence limits. Check that you get the same results as in question a.

```{r}
print(expcoef(fit.smoke))
```

### c

We then use logistic regression to study the effect of age (at entry to the study) for the risk of developing CHD:

```{r}
fit.age=glm(chd69~age,data=wcgs,family=binomial)
summary(fit.age)

```

Use the expcoef-function to estimate the odds ratio for a one-year increase in age with 95% confidence interval.

```{r}
expcoef(fit.age)

```

### d

When interpreting the effect of age, it may be reasonable to give the odds ratio corresponding to a ten-year increase in age (rather than a one-year increase as we did in question c). The easiest way to achieve this is to fit the logistic model using age/10 as covariate:

```{r}
fit.age10=glm(chd69~I(age/10),data=wcgs,family=binomial)
summary(fit.age10)
```

Compare the estimates and standard errors of this model with the one in question c. What do you see? That $/beta_1$ is now 10 times larger. As expected we got the same value but scaled to the new units.

You may then use the expcoef-function to estimate the odds ratio for a ten-year increase in age with 95% confidence interval. (Why?)

Because each 1 unit increase ($\Delta$ in the slides) corresponds to 10 years now.

```{r}
expcoef(fit.age10)

```

Which is equivalent to the previous case using $\Delta=10$:

```{r}
print(round(exp(fit.age$coefficients[2]*10),5))
```

## Exercise 19: logistic regression with several predictors

In this exercise we will use data from the Western Collaborative Group Study (WCGS), a large epidemiological study designed to study risk factors for coronary heart disease (CHD). More than 3000 men were recruited to the study, and a number of (potential) risk factors were recorded at entry. The men were then followed for about ten years and it was recorded if they developed CHD or not. The WCGS data are used in many of the examples in Section 3.4 and Chapter 6 of the course text book by Vittinghoff et al.

You may read the WCGS data into R by the command:

```{r}
wcgs=read.table("http://www.uio.no/studier/emner/matnat/math/STK4900/data/wcgs.txt",sep="\t", header=T,na.strings=".")
```

### a

Our starting point in this exercise is the model for CHD risk using the predictors age (per 10 years), cholesterol (per 50 mg/dL), systolic blood pressure (per 50 mmHg), body mass index (per 10 kg/m2), smoking (yes, no), and behavioral pattern (with the four groups 1=A1, 2=A2, 3=B3, 4=B4).

You may fit this model by the commands on slide 19:

```{r}
#Remember to set as a factor given that we have 4 groups
wcgs$behcat=factor(wcgs$behpat)
wcgs.beh=glm(chd69~age_10+chol_50+sbp_50+bmi_10+smoke+behcat, data=wcgs, family=binomial, subset=(chol<600))
summary(wcgs.beh)

```

Perform the commands and check that you get the results reported on slide 19. Comment on the results. In particular discuss the effects of the behavioral patterns

### b

In question a we consider a model with four behavioral groups. One may wonder if it is sufficient to consider only two behavioral groups (A-behavior and B-behavior). To this end we fit a model with the binary covariate dibpat (which is coded as 0 for B3 and B4, and as 1 for A1 and A2):

```{r}
wcgs.beh2=glm(chd69~age_10+chol_50+sbp_50+bmi_10+smoke+dibpat, data=wcgs, family=binomial, subset=(chol<600))
summary(wcgs.beh2)
```

To test the null hypothesis that the effects behavioral patterns A1 and A2 are the same and the effects behavioral patterns B1 and B2 are the same, we compare the (residual) deviances for the model in a and the model considered here:

```{r}
anova(wcgs.beh2,wcgs.beh, test="Chisq")
```

Perform the command and explain what the output tells you (cf slide 26). What do you conclude from the test?

There is no significant differences.

### c

We then fit a model without behavioral pattern and compare the three models in an analysis of deviance table:

```{r}
wcgs.resc=glm(chd69~age_10+chol_50+bmi_10+sbp_50+smoke, data=wcgs, family=binomial, subset=(chol<600))
anova(wcgs.resc,wcgs.beh2,wcgs.beh, test="Chisq")
#example: pvalue of third model =(1-(pchisq(0.232,df=2)))
```

Perform the command and explain what you may conclude from the analysis of deviance table?

Model 2 is significantly better than model 1, but the differences between model 2 and 3 are not significant.

### d

Which of the models in question a, b, and c would you prefer?

Use the expcoef-command from Exercise 18 to obtain the estimated odds ratios with 95% confidence intervals for your preferred model.

```{r}
expcoef(wcgs.beh2)
```

Discuss what these odds ratios tell you about the risk factors for CHD.

### Exercise 20

In this exercise we will study data from an experiment where one wanted to assess the toxicity of the substance rotenone. Groups of about 50 insects were exposed to various doses of rotenone, and the number of insects that died at each dose level was recorded. The data are available at the course web-page. The variables in the data set are coded as follows: - LOGDOSE Logarithm of the concentration of rotenone (base 10 logarithm) - NUMBER Number of insects in total - DEAD Number of insects that died

```{r}
# Clean up the memory before we start.
rm(list=ls(all=TRUE))
# Read data.
insects.data = read.table(file="http://www.uio.no/studier/emner/matnat/math/STK4900/data/insects.txt", header=TRUE)
# Take a look at the data. 
sum(is.na(insects.data))
head(insects.data)

```

### a

Compute the proportion of insects that died at each dose level and make a plot of the proportions versus dose of rotenone.

```{r}
insects.data[,"proportion"] = insects.data[,"DEAD"]/insects.data[,"NUMBER"]
# Draw a scatter plot between "LOGDOSE" and "proportion"
# Open a pdf device.
#pdf("./Exercise_20_a.pdf", width = 6, height = 6)
# Scatter plot
plot(x=insects.data[,"LOGDOSE"], y=insects.data[,"proportion"], ylim=c(0,1), pch=16,xlab="LogDose",ylab="Prop.")
# Turn off the device.
#dev.off()

```

### b

Fit a suitable regression model for the relation between the proportions of insects that died and the doses of rotenone. Give the reasons for your choice of regression model and interpret the fitted model.

Fit a logistic regression model with logdose as covariate and look at the result: (Note that since we have grouped data, not binary, the response has to be specified as cbind(y,n-y) where n=number of individuals in a group and y=number of "successes" in the group.)

(There are two ways to fit a GLM with R: either the oucome is a vector of 0/1 or a factor with two levels; or a matrix with two columns of counts for success/failure, as in this case)

```{r}
glm.obj = glm(cbind(DEAD,NUMBER-DEAD)~LOGDOSE, data=insects.data, family=binomial)
summary(glm.obj)
```

```{r}
new.data.frame=c()
for (i in 1:nrow(insects.data)){
  for (j in 1:insects.data$DEAD[i]){
    new.data.frame=rbind(new.data.frame,c(insects.data$LOGDOSE[i],1))
  } 
  for (j in 1:(insects.data$NUMBER[i]-insects.data$DEAD[i])){
    new.data.frame=rbind(new.data.frame,c(insects.data$LOGDOSE[i],0))
  }
}
new.data.frame=data.frame(new.data.frame)
colnames(new.data.frame)<-c("LOGDOSE","y")
summary(glm(y~LOGDOSE,data=new.data.frame,family=binomial))

```

The deviance overall is different because the Bernouilli approach focus on variability at the individual level and the number of experiments is of course larger , whereas the Binomial approach groups the data and assumes a much smaller number of samples.

### c

Assess the fit of the model by including the fitted proportions on the plot from question a. Also give a formal goodness-of-fit test.

```{r}

# grid over x axis to visualize predicted values from the model.
x.grid = seq(from=0.3, to=1.2, by=0.01)
# Compute the probabilities obtained from the fitted model.
y.hat = predict(object=glm.obj, newdata=data.frame(LOGDOSE=x.grid), type="response")

# Draw a scatter plot between "LOGDOSE" and "proportion"
# Open a pdf device.
#pdf("./Exercise_20_c.pdf", width = 6, height = 6)
# Scatter plot
plot(x=insects.data[,"LOGDOSE"], y=insects.data[,"proportion"], ylim=c(0,1), pch=16)
# Visualize the fitted model.
points(x=x.grid, y=y.hat, type="l", col="blue")
# Turn off the device.
#dev.off()
```

Goodness of fit test. Checking the deviance as above:

```{r}
#Reference model where p=cte
raw=glm(formula=cbind(DEAD,NUMBER-DEAD)~1,family=binomial,data=insects.data)
print(summary(raw))
#Compare with the model:
anova(raw,glm.obj,test="Chisq")
```

Remember the F test in the linear model framework.

### d

Use the fitted model to estimate LD50, i.e. the dose required to kill half the members of a tested population.

```{r}
# Use the formula from slide 24, Lecture 6.
# Extract coefficients from the model.
beta.0.hat = glm.obj$coefficients[1]
names(beta.0.hat) = NULL
beta.1.hat = glm.obj$coefficients[2]
names(beta.1.hat) = NULL
# Target probability
p = 0.5
# Convert p into LD50.
LD50 = (log(p/(1-p)) - beta.0.hat)/beta.1.hat
print(paste0("LD50 :",round(LD50,3)))

# Check whether the solution is correct.
predict(object=glm.obj, newdata=data.frame(LOGDOSE=LD50), type="response")
#Graphically:
plot(x=insects.data[,"LOGDOSE"], y=insects.data[,"proportion"], ylim=c(0,1), pch=16)
# Visualize the fitted model.
points(x=x.grid, y=y.hat, type="l", col="blue")
abline(h=0.5,col="green",lty=3)

```