---
title: "Day 3 exercises"
author: "Alfonso Diz-Lois"
date: "2025-02-13"
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```


## Exercise 5.d:  Revisited

```{r}
# Clean up the memory before we start.
rm(list=ls(all=TRUE))

# Read the data into R and make a plot of the data.

# Enter the data.
pef.data = data.frame(pef = c(494,395,516,434,476,413,442,433),
                      minipef = c(512,430,520,428,500,364,380,445))

plot(pef.data)

# Fit a linear model.
lm.obj = lm(minipef~pef, data=pef.data)
# Print the result.
summary(lm.obj)

```

Now let's estimate each of the terms manually:

residual standard error:

```{r}
deg.fr=nrow(pef.data)-1-1 #Degrees of freedom = n-p-1
sqrt(sum((lm.obj$residuals-mean(lm.obj$residuals))^2/deg.fr)) 

```

Multiple R-squared:

```{r}
#In this simple regression setting this is
rsquared=cor(pef.data$pef,pef.data$minipef)^2
#But more generally:
#1-(RSS/TSS)
rsquared
```

Adjusted R-squared:

$$
\hat{R}^2=1-\frac{(1-R^2)(n-1)}{n-p-1}
$$

```{r}
#1 - (1 - R-squared) * (n - 1) / (n - p - 1)
print(1-(1-rsquared)*(nrow(pef.data)-1)/(nrow(pef.data)-1-1))

```


Fstatistic

```{r}


MSS=sum((predict(lm.obj,newdata=pef.data)-mean(pef.data$minipef))^2)
TSS=sum((pef.data$minipef-mean(pef.data$minipef))^2)
RSS=sum((lm.obj$residuals)^2)
p=1
n=nrow(pef.data)
Fstat=(MSS/p)/(RSS/(n-p-1))
Fstat
```

pvalue

```{r}
1-pf(Fstat,p,n-p-1)
```


## Exercise 8:  Coffee Sales

### a
```{r}

# Clean up the memory before we start.
rm(list=ls(all=TRUE))

# Read the data into a dataframe and give names to the variables.
coffee.data = read.table("http://www.uio.no/studier/emner/matnat/math/STK4900/data/exer3_1.dat", header=FALSE)
names(coffee.data)=c("n.dispenser","sale")

# Take a look at the data.
coffee.data


# Create a scatter plot.
plot(x=coffee.data[,"n.dispenser"],
     y=coffee.data[,"sale"],
     xlab="Number of dispensers",
     ylab="Coffee sales")


```


```{r}
# Inspect the plot. How is the relation between the number of dispensers and the coffee sale?

# Fit a linear regression.
lm.obj = lm(sale~n.dispenser, data=coffee.data)
summary(lm.obj)

# Fit a second order polynomial.
poly.lm.obj = lm(sale~n.dispenser+I(n.dispenser^2), data=coffee.data)
summary(poly.lm.obj)


# Visualize two models.

# Create a scatter plot.
plot(x=coffee.data[,"n.dispenser"],
     y=coffee.data[,"sale"],
     xlab="Number of dispensers",
     ylab="Coffee sales",
     main="")
# Plot the fitted linear model.
abline(lm.obj, col="blue")
# Notice that abline gives a warning now.
#abline(poly.lm.obj, col="red")
lines(x=coffee.data$n.dispenser,y=predict(poly.lm.obj,newdata=coffee.data),col="red")
```

## Exercise 9:  Insurance


```{r}
# Clean up the memory before we start.
rm(list=ls(all=TRUE))

# Read the data into a dataframe and give names to the variables.
insurance.data = read.table("http://www.uio.no/studier/emner/matnat/math/STK4900/data/exer3_2.dat", header=FALSE)
names(insurance.data)=c("income","risk.avs","amount")
sum(is.na(insurance.data))
```


```{r}
# Take a look at the data.
insurance.data
summary(insurance.data)
# Make sure that you understand what the summary measures tell you!
```

```{r}

# Create scatter plots between 'amount' and other variables.

# 1 by 2 grid of plots.
par(mfrow=c(1,2))
# Scatter plot: 'amount' vs 'income'
plot(x=insurance.data[,"income"],
     y=insurance.data[,"amount"],
     xlab="Annual income",
     ylab="Amount of life insurance",
     main = "")
# Scatter plot: 'amount' vs 'risk.avs'
plot(x=insurance.data[,"risk.avs"],
     y=insurance.data[,"amount"],
     xlab="Risk aversion score",
     ylab="Amount of life insurance",
     main = "")


```


```{r}

par(mfrow=c(1,1))

# Alternative: scatter plot matrix
plot(insurance.data)

# What do the plots tell you?

# Compute the correlation between the variables.
cor(insurance.data)
# Do the correlations agree with what you saw from the plots?
```
Carry out univariate regression analysis of 'amount' versus each of the other two variables.

```{r}
lm.obj.1 = lm(amount~income, data=insurance.data)
summary(lm.obj.1)

```


```{r}
lm.obj.2 = lm(amount~risk.avs, data=insurance.data)
summary(lm.obj.2)

```

Which of the two variables, income and risk aversion, is most important for explaining the variation in the amount of life insurance carried?

Does any of the variables (alone) have a significant effect?

Carry out a regression analysis including both income and risk aversion:

```{r}

lm.obj.3 = lm(amount~income+risk.avs, data=insurance.data)
summary(lm.obj.3)

```

What does this model tell you? Does it look better than the best of the two models with only one covariate?

Try yourself models with second order terms for income and/or risk aversion.


```{r}
lm.obj.1.poly = lm(amount~income+I(income^2), data=insurance.data)
summary(lm.obj.1.poly)
```

```{r}
lm.obj.1.poly.b = lm(amount~I(income^2), data=insurance.data)
summary(lm.obj.1.poly.b)

```

Why is this?

Note that (from slide 13):

$$

V\hat{a}r(\hat{\beta_j})=\frac{s_{y|x}^2}{(n-1)s_x^2(1-\bf{r_j^2})}
$$
And in this case....

```{r}

cor(insurance.data$income,insurance.data$income^2)

```

Similarly:

```{r}

lm.obj.2.poly = lm(amount~risk.avs+I(risk.avs^2), data=insurance.data)
summary(lm.obj.2.poly)

```

```{r}
lm.obj.2.poly.b = lm(amount~I(risk.avs^2), data=insurance.data)
summary(lm.obj.2.poly.b)
```


```{r}

x.grid = seq(min(insurance.data$risk.avs)-1, max(insurance.data$risk.avs)+1, length.out=1000)
y.predicted = predict.lm(lm.obj.2.poly, newdata=data.frame(risk.avs=x.grid))

par(mfrow=c(1,2))
# Scatter plot: 'amount' vs 'income'
plot(x=insurance.data[,"income"],
     y=insurance.data[,"amount"],
     xlab="Annual income",
     ylab="Amount of life insurance",
     main = "")
abline(lm.obj.1, col="red")
# Scatter plot: 'amount' vs 'risk.avs'
plot(x=insurance.data[,"risk.avs"],
     y=insurance.data[,"amount"],
     xlab="Risk aversion score",
     ylab="Amount of life insurance",
     main = "")
lines(x=x.grid,
      y=y.predicted,
      col="blue")
par(mfrow=c(1,1))


```


##Exercise 10:  multiple linear regression


```{r}
# Clean up the memory before we start.
rm(list=ls(all=TRUE))


# The Heart and Estrogen/Progestin Study (HERS) is a clinical trial of hormone therapy for prevention of recurrent
# heart attacks and death among post-menopausal women with existing coronary heart disease.
# The HERS data are used in many of the examples in Chapters 3 and 4 of the course text book byVittinghoff et al.
# In this exercise we will study how different variables may influence the glucose level in the blood for the non-diabetic women in the cohort,
# in particular we are interested to see if exercise may help to reduce the glucose level (cf. Section 4.1 in Vittinghoff et al.).

# Read the data.
HERS.data = read.table("http://www.uio.no/studier/emner/matnat/math/STK4900/data/hers.txt", header=TRUE, sep="\t", na.strings=".")
# Keep only women without diabetes.
HERS.data.1 = HERS.data[(HERS.data[,"diabetes"] == 0), ]
rownames(HERS.data.1) = NULL
```

We will start out by investigating (in questions a-c) how the glucose levels are for women who exercise at least three times a week (coded as exercise=1) and women who exercise less than three times a week (coded as exercise=0).


### a Make a summary and boxplot of the glucose levels according to the level of exercise:

```{r}

# Create summary
summary(HERS.data.1[(HERS.data[,"exercise"] == 0), "glucose"])
summary(HERS.data.1[(HERS.data[,"exercise"] == 1), "glucose"])
table(HERS.data[,"exercise"])
# Create a boxplot.
# Draw boxplot.
boxplot(glucose~exercise, data=HERS.data.1, names=c("No exercise","Exercise"))
```

Discuss what the summaries and boxplot tell you.


```{r}

# Another way to explore the data might be using histograms to inspect distributions.
par(mfrow=c(2,1))
hist(HERS.data.1[HERS.data.1$exercise==0, "glucose"], breaks=40, xlim=c(40, 140), main="No exercise", xlab="glucose")
hist(HERS.data.1[HERS.data.1$exercise==1, "glucose"], breaks=40, xlim=c(40, 140), main="Exercise", xlab="glucose")
par(mfrow=c(1,1))

```


###b Test whether there is a difference in glucose level and make a confidence interval:


```{r}
# 
t.test(glucose~exercise, var.equal=TRUE, data=HERS.data.1)
# What may you conclude for the test and the confidence interval?

```


### c Perform a simple linear regression with glucose level as outcome and exercise as predictor:

```{r}

lm.obj = lm(glucose~exercise, data=HERS.data.1)
summary(lm.obj)
# Discuss how the result of the simple linear regression relates to those in question b).

```

### d The women who exercise at least three times a week and the women who exercise less than three times a week may differ in many ways. For example they may be younger and have a lower BMI (body mass index). We will therefore perform a multiple linear regression analysis where we adjust for these to variables.

```{r}

# Perform a multiple linear regression with glucose level as outcome and exercise, age, and BMI as predictors.
lm.obj.2 = lm(glucose~exercise+age+BMI, data=HERS.data.1)
summary(lm.obj.2)
# Discuss the result of this analysis in relation with the result in question c). Also discuss how age and BMI influence the glucose level.

```