---
title: "Day 4 exercises"
author: "Alfonso Diz-Lois"
date: "2025-02-14"
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```


## Exercise 11:  


```{r}
# Clean up the memory before we start.
rm(list=ls(all=TRUE))

# Check and change working directory.
#getwd()
#setwd("C:/Users/username/Documents")

# In this exercise we will use data from the Heart and Estrogen/Progestin Study (HERS), a clinical trial of
# hormone therapy for prevention of recurrent heart attacks and death among post-menopausal women with existing coronary heart disease.

# The aim of this exercise is to study how the change in low-density lipoprotein (LDL) cholesterol over the first year of
# the HERS study depends on the baseline value of LDL (i.e. the value at entry to the study) and use of hormone therapy (HT).

# Read data
HERS.data = read.table("http://www.uio.no/studier/emner/matnat/math/STK4900/data/hers.txt", header=TRUE, sep="\t", na.strings=".")

# Before we start doing our analysis we define the change in LDL, denoted LDLch.
HERS.data[,"LDLch"] = HERS.data[,"LDL1"] - HERS.data[,"LDL"]
#We also defined the centered LDL at baseline (by subtracting the mean value 145 mg/dL), denoted cLDL
HERS.data[,"cLDL"] = HERS.data[,"LDL"] - 145


# a)
# Fit a linear model with the change in LDL as the response and hormone therapy (HT) and baseline LDL (not centered) as covariates:
lm.obj.1 = lm(LDLch~HT+LDL, data=HERS.data)
summary(lm.obj.1)
# Give an interpretation of the estimates in the fitted model.


```

###b Fit a model with HT and centered LDL at baseline as covariates


```{r}
lm.obj.2 = lm(LDLch~HT+cLDL, data=HERS.data)
summary(lm.obj.2)
```

Compare the estimates for this model with those in question a). What is the interpretation of the estimate for the intercept? 
It is the expected value of the output (LDLch) when both variables (HT and cLDL) are 0. Note that, because we have re-centered the LDL concentration, cLDL=0 means that LDL=145.  


```{r}

# The package 'scales' permits the use of the function 'alpha' which controls the transparency of the colours in R.
# It is relly useful trick for crowded scatterplots!
par(mfrow=c(1,2))
# HT==0
plot(LDLch~cLDL, data=HERS.data[HERS.data$HT==0, ], main="Control", cex=.6, col=scales::alpha("darkgrey", .5), xlim=c(-100, 200), ylim=c(-200, 150))
abline(a=lm.obj.2$coefficients[1], b=lm.obj.2$coefficients[3], col="green3", lwd=2)
# HT==1
plot(LDLch~cLDL, data=HERS.data[HERS.data$HT==1, ], main="Treatment", cex=.6, col=scales::alpha("darkgrey", .5), xlim=c(-100, 200), ylim=c(-200, 150))
abline(a=lm.obj.2$coefficients[1]+lm.obj.2$coefficients[2], b=lm.obj.2$coefficients[3], col="darkorange", lwd=2)
par(mfrow=c(1,1))

```
###c Now take the interaction into account

```{r}
lm.obj.3 = lm(LDLch~HT+cLDL+I(HT*cLDL), data=HERS.data)
summary(lm.obj.3)
```

Is there a significant interaction?
Give an interpretation of the estimates in the model: 
For a complete explanation, elaborate from the example on slide 21 of Lecture 4 for 1 binary and 1 numerical covariate. In general, the model assumes that, in addition to the effect of HT on the intercept considered in $\beta_1$,  there is an additional effect from the interaction term on the slope of the numerical covariate (cLDL in this case) depending on the value of HT. In other words, the slope for those with HT=0 is different from those with HT=1.


In particular, what is the effect of baseline LDL for those with no hormone therapy?
And what is the effect of baseline LDL for those on hormone therapy?
This all comes from the interpretation of the different betas. If we set HT=0, the interaction term disappears and we are left with the main slope $\beta_2$. Alternatively, the slope for those on hormone therapy (HT=1) according to the model is  $\beta_2+\beta_3$    (check slide 21)

```{r}

# Let's compare the coefficients

lm.obj.2$coefficients
lm.obj.3$coefficients

par(mfrow=c(1,2))
# HT==0
plot(LDLch~cLDL, data=HERS.data[HERS.data$HT==0, ], main="Control", cex=.6, col=scales::alpha("darkgrey", .5), xlim=c(-100, 200), ylim=c(-200, 150))
abline(a=lm.obj.3$coefficients[1], b=lm.obj.3$coefficients[3], col="green3", lwd=2)
# HT==1
plot(LDLch~cLDL, data=HERS.data[HERS.data$HT==1, ], main="Treatment", cex=.6, col=scales::alpha("darkgrey", .5), xlim=c(-100, 200), ylim=c(-200, 150))
abline(a=lm.obj.3$coefficients[1]+lm.obj.3$coefficients[2], b=lm.obj.3$coefficients[3]+lm.obj.3$coefficients[4], col="darkorange", lwd=2)
par(mfrow=c(1,1))


```
Note the different slopes.

```{r}
#### Bonus: confidence vs. prediction intervals ####

# We start by fittin a simple model
lmSimple = lm(LDLch~cLDL, data=HERS.data)
summary(lmSimple)

# Now we prepare data for "prediction"
newdata = data.frame(cLDL=seq(min(HERS.data$cLDL, na.rm=T)-1, max(HERS.data$cLDL, na.rm=T)+1, length.out=1000))
confInts = predict(lmSimple, newdata=newdata, interval="confidence")
predInts = predict(lmSimple, newdata=newdata, interval="prediction")

# Let's plot both intervals and see what happens.
plot(HERS.data$cLDL,
     HERS.data$LDLch,
     cex=.8,
     pch=16,
     col=scales::alpha("darkgrey", .5),
     xlab="cLDL",
     ylab="LDLch")
abline(a=lmSimple$coefficients[1], b=lmSimple$coefficients[2], col="red", lwd=2)

lines(newdata$cLDL, confInts[,"lwr"], lty=3, col="blue", lwd=1.5)
lines(newdata$cLDL, confInts[,"upr"], lty=3, col="blue", lwd=1.5)

lines(newdata$cLDL, predInts[,"lwr"], lty=3, col="darkorange", lwd=1.5)
lines(newdata$cLDL, predInts[,"upr"], lty=3, col="darkorange", lwd=1.5)
```

## Exercise 12

```{r}
# Clean up the memory before we start.
rm(list=ls(all=TRUE))

# Read the data.
gun.data = read.table(file="http://www.uio.no/studier/emner/matnat/math/STK4900/v17/gun.dat", col.names=c("method","phys","team","rounds"))
# Take a look at the data.
gun.data


```

####a 

```{r}
# Compute correlations.
cor(gun.data)

```

How are the correlations between the covariates ("method","phys","team")? 
0.
Can you explain the reason for this? All the combinations are considered:

```{r}
print(table(gun.data[gun.data$method==1,c("phys","team")]))
print(table(gun.data[gun.data$method==2,c("phys","team")]))
#Just to show what happens if we remove one of the observations, then the points are no longer symmetric and we do get some correlation among the "factor" variables
print(cor(gun.data[2:nrow(gun.data),]))
```

```{r}
plot(gun.data)

table(gun.data$method)
table(gun.data$phys)
table(gun.data$team)
```
This plot is also quite illustrative about the correlations we got before.


```{r}
# b), c), d)
# Define the covariates as factors (categorical covariates).
gun.data[,"method"] = factor(gun.data[,"method"])
gun.data[,"phys"] = factor(gun.data[,"phys"])
gun.data[,"team"] = factor(gun.data[,"team"])


# Fit a model with main effects and interactions and write the anova table:
lm.obj = lm(rounds~method*phys*team, data=gun.data)
summary(lm.obj)

anova(lm.obj)


# What does the anova table tell you? (See slide 33, Lecture 5)
# Which interactions and main effects are significant?

```

Anova is telling us that adding team as whole does not improve the model fit significantly. But according to the lm model we see that team2 seems relevant.
We may explore that:

```{r}

gun.data$team2=(gun.data$team==2)
lm.obj = lm(rounds~method*phys*team2, data=gun.data)
summary(lm.obj)
anova(lm.obj)


```
Now the model finds also relevant the phys variable and the interaction phys:team2. Now the adjusted Rsquared is better because we got almost the same goodness of fit and we reduced the effective number of parameters.


Just so that we understand anova better in this context:

```{r}

lm.obj = lm(rounds~method+phys+team, data=gun.data)
anova(lm.obj)
lm.obj2= lm(rounds~method+team, data=gun.data)
anova(lm.obj,lm.obj2)

```


##Exercise 12.b

For the data in R-exercise 12 on the gun data we could be interested in predicting the rounds given specified levels of method, physique and team
and find confidence interval for estimated expected values as well as prediction intervals for new observations given the levels of these factors.


```{r}

# a) Set up a data frame for values where you would like to make predictions, e.g.
gun.test=data.frame(method=factor(c(1,2,1,2)),
                    phys=factor(c(1,1,2,3)),
                    team=factor(c(1,2,3,1)))

# Then find fitted/predicted values for your favourite model gfitfav from R-exercise 12 by
predict(lm.obj, newdata=gun.test)


# b) Then obtain confidence intervals for the expected values at this levels of the factors by
predict(lm.obj, newdata=gun.test, interval="confidence")


# c) Next find the corresponding prediction intervals by
predict(lm.obj, newdata=gun.test, interval="prediction")


```

Compare and discuss the difference between the confidence and prediction intervals. Check slides from 34 to 37.