---
title: "Exercise 4"
output:
  html_document: default
  pdf_document: default
date: "2024-02-05"
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(
	echo = TRUE,
	message = FALSE,
	warning = FALSE,
	results = "hold"
)
```


# Cross-validation
#### (a) Simulate data

```{r}
# simulate data
set.seed(1)
y <- rnorm(100)
x <- rnorm(100)
y <- x - 2 * x^2 + rnorm(100)

# scatter plot
plot(x, y)
```

We simulate from the following model:

$$
y_i =\beta_0 + \beta_1*x_i  + \beta_2*x_i^2 + \epsilon_i, \quad \epsilon_i \sim N(0, 1) \\
\beta_0 = 0, \beta_1 = 1, \beta_2 = -2, 
$$


#### (c)-(d) Run LOOCV
Here, I have written a function that performs k-fold CV for the linear regression case. It poperates on a data frame and use the stats::lm functionality:
```{r}
# function that computes the CV-error of LS-estimator:
# - k: (integer) number of folds
# - df: (data.frame) training data, includes both outcome and predictors
# - yvar: (character) name of outcome variable in df
# - formula: (formula) formula argument to stats::lm. Defaults to "yvar ~."
cv_using_lm <- function(k, df, yvar, formula = as.formula(paste0(yvar, "~."))) {
  
  # compute fold-size
  N <- nrow(df)
 
  # assign the observations randomly k test sets
  fold <- sample(rep(seq_len(k), length.out = N))
  #table(fold) 
  
  # init vector for storing errors
  err <- numeric(k)
  for (kk in seq_len(k)) {
    
    # indicator for the test-observations in current fold
    inFold <- fold == kk          
      
    # fit model to current train-set 
    fit    <- lm(formula, data = df[!inFold, ])  
    
    # predict current test-set outcomes
    yhat  <- predict(fit, newdata = df[inFold, ]) 
    
    # compute loss  
    y     <- df[[yvar]][inFold]                   
    err[kk] <- sum((y-yhat)**2)/length(y)         
  }
  return(err)
}

# for each model, store the outcome and predictors in a data frame
dfs <- list(data.frame(y = y, x))
dfs[[2]] <- cbind(dfs[[1]], x**2)
dfs[[3]] <- cbind(dfs[[2]], x**3)
dfs[[4]] <- cbind(dfs[[3]], x**4)

# estimate prediction error with N-fold CV
n <- length(y)
set.seed(007)
sapply(dfs, function(df) mean(cv_using_lm(n, df, "y")))

# estimate prediction error with N-fold CV
set.seed(007+1)
sapply(dfs, function(df) mean(cv_using_lm(n, df, "y")))
```

There is only one way to split the data into N folds of size 1, therefore the error estimates are exactly the same.

#### (e) Which model has the lowest error?
The model fitting a second order poly has the lowest error. 
Since we have simulated the data, we know that this is the true model and expect therefore this model to be a good fit to the test-set in each fold. The training error, on the other hand, will always decrease by including additional regressors.


#### (f) Compare significance of coefficient estimates
```{r}
# compute p-vals for the coefficients and store them in a matrix
p_vals <- matrix(nrow = 5, ncol = length(dfs))
for (i in seq_along(dfs)) {
  fit <- lm(y~., data = dfs[[i]])
  coeffs <- coefficients(fit)
  sds    <- sqrt(diag(vcov(fit)))
  t_vals <- coeffs/sds
  p_vals[seq_len(i+1), i] <- 2*(1-pt(abs(t_vals), n-length(coeffs)))
}

rownames(p_vals) <- names(coeffs)
colnames(p_vals) <- paste0("Model", seq_along(dfs))
round(p_vals, 3)
```

We see that the coefficient on the first and second order term is significant,
except from the first model. The intercept is significant only in this first model.
The higher order terms is not significant. 
From the scatterplot in (b) wee see that there is a clear non-linear relationship between y and x. 
No straight line gives a good fit to these data. 
Therefore we wrongly conclude in model 1 that the intercept is different from 0 and the slope is not.


# Ex 7.9 (Hastie 2009, p. 259) 
The exercise reads "Ex. 7.9 For the prostate data of Chapter 3, carry out a ***best-subset*** linear regression analysis, as in Table 3.3 (third column from left). Compute the AIC, BIC, five- and tenfold cross-validation, and bootstrap .632 estimates of prediction error. Discuss the results."


Data can be downloaded from book-website: https://hastie.su.domains/ElemStatLearn/:
- column 1-8 are predictors
- column 9 is the outcome (lpsa)
- column 10 divides the data into a test and train set


#### Import data
```{r}

# I let the -here- package help me with relative paths
# this command defines the root-directory
here::i_am("exercises/exercises_4_sol.Rmd") 

# read data from local file
data <- read.csv(here::here("./data/prostate.txt"), 
                 sep = "\t", 
                 header = TRUE, 
                 stringsAsFactors = TRUE)
data[1] <- NULL  # remove first column (row-numbers)
head(data)
```
As described in ch. 3.2.1,  the predictors are scaled to a variance of 1 and the data is divided into test-and-train set.

```{r}
# construct data.frame including scaled predictors and the outcome 
df <- data.frame(lapply(data[, 1:8], function(x) (x-mean(x))/sd(x)),
                 lpsa = data$lpsa[])
sapply(df, var) # variance of each scaled variable

# store test/train indicator as a variable 
dfs <- list(train = df[data$train == TRUE, ],
            test  = df[data$train == FALSE, ])
```

Now, we can reproduce the LS-estimates in table 3.3:
```{r}
# fit linear regression model 
fit <- lm(lpsa ~ ., data = dfs$train)
summary(fit)
```


### Best subset selection 
We want to find the best subset of predictors to include in a linear regression model,
using different criteria (AIC, BIC, CV, .632-bootstrap). 

We can re-use the CV-function we defined in problem 1 to apply 5- and 10-fold CV.
To apply the .632 boostrap estimator, we need a similar function, for example:

```{r}
# function that computes the .632boostrap error of the LS-estimator:
# - k: (integer) number of folds
# - df: (data.frame) training data, includes both outcome and predictors
# - yvar: (character) name of outcome variable in df
# - formula: (formula) formula argument to stats::lm. Defaults to "yvar ~."

bootstrap_632_err <- function(B, df, yvar, formula = as.formula(paste0(yvar, "~."))) {
  N <- nrow(df)
  
  # compute bootstrap error 
  bootstrap.err.b <- matrix(NA, nrow = B, ncol = 1) 
  for (b in 1:B) {
    set.seed(b) # set seed, useful for data replication
    index <- sample(N, replace = TRUE)
    # the selected observations form the training set
    temp.train.data <- df[index, ]
    # the rest the test set
    temp.test.data <- df[-index, ]
    # fit model on temporary train set
    fit <- lm(as.formula(paste0(yvar, "~.")), data = temp.train.data)
    
    # compute the error
    computeError <- function(model, newdata)
      mean((newdata[ , yvar] - predict(model, newdata = newdata))^2)
    
    # save the error in the matrix initialized above
    bootstrap.err.b[b] <- computeError(fit, temp.test.data)
    
    #if(b%%100 == 0) print(b)
  }
  
  # compute training error: 
  fit <- lm(as.formula(paste0(yvar, "~.")), data = df)
  err.train <- computeError(fit, df)
  
  # return 0.632 bootstrap error
  return(0.368*err.train + 0.632*mean(bootstrap.err.b))
}

# test function on full training data set
bootstrap_632_err(1000, dfs$train, "lpsa")
```


Now, we can loop through every possible subset of the predictors and estimate the prediction error with the different methods. 
*Note*: There are probably more clever way to do this, but this what the first that come to my mind during the unexpected live-coding-session in class. 
```{r}
set.seed(007)
p <- 8
N <- nrow(dfs$train)
err <- list()

# for each subset of size m ...
for (m in seq_len(p)) {
  
  # list all subset of size m of the p variables
  all_comb <- combn(p, m)
  
  # initialize a matrix for storing the errors 
  tmp <- matrix(nrow = 5, ncol = ncol(all_comb))
  rownames(tmp) <- c("aic", "bic", "cv5", "cv10", "boostrap632")
  err[[m]] <- tmp 
  
  # for each size m subset...
  for (i in seq_len(ncol(all_comb))){
    
    # construct a data.frame containint the outcome and current predictor variables
    vars <- all_comb[, i]
    temp_df <- data.frame(lpsa = dfs$train$lpsa, dfs$train[,  vars])
    
    # fit the model 
    fit <- lm(lpsa~., data = temp_df)
    d <- m + 2     # number of estimated params: number of coefficients + intercept + residual variance
    
    # evaluate log-lik 
    logLik <- logLik(fit)


    # aic 
    r <- 1
    err[[m]][r, i] <- 2*d/n-2/N*logLik
    
    # bic 
    r <- r+1
    err[[m]][r, i] <- d*log(N) - 2*logLik
    
    # cv 5-fold
    r <- r+1
    err[[m]][r, i]  <- mean(cv_using_lm(5, temp_df, "lpsa"))
    
    # cv 10-fold
    r <- r+1
    err[[m]][r, i]  <- mean(cv_using_lm(10, temp_df, "lpsa"))
    
    # bootstrap .632
    # (note that I sat B = 100 to reduce computation time)
    r <- r+1
    err[[m]][r, i]  <- bootstrap_632_err(B = 100, df = temp_df, yvar = "lpsa") 
  }
}
```

Now, we want to find which subset is the best according to the different criteria. 
Here, I do so by collecting the estimates in a data frame in order to use the dplyr package to find the minimum over all subsets.

```{r}
library(dplyr)
df <- setNames(reshape2::melt(err), c("crit", "comb", "value", "size")) %>% 
  select(crit, size, comb, value) 
head(df)
```

Then, I can group the data by each criteria, and find the minimum error.
```{r}

res <- df %>% 
  group_by(crit) %>% 
  arrange(crit) %>% 
  filter(value == min(value))
head(res)
```
We see that the different subsets (identified by size and comb variables) are associated with the lowest errors depending on the criteria used. 
Finally, we can re-fit each of the "best subsample"-sets chosen by each criteria on the full training data set. 

```{r}

# fit linear regresion full model 
fit  <- lm(lpsa ~., dfs$train)
beta <- coefficients(fit)

# init matrix for storing the coefficient estimates of each subset
coeffs <- matrix(nrow  = length(beta), ncol = nrow(res) +1)
coeffs[, 1] <- beta
colnames(coeffs) <- c("LS", levels(res$crit)[res$crit])
rownames(coeffs) <- names(beta)

# fit models to each of the chosen subsets
for (i in 1:nrow(res)) {
  m <- res$size[i]          
  comb <- res$comb[i]
  
  # find the variables included in the subset
  vars <- combn(p, m)[, comb]
  
  # construct a data frame with this subset of variables
  temp_df <- data.frame(lpsa = dfs$train$lpsa, dfs$train[,  vars])
  
  # fit the model and collect coefficient estimates
  fit <- lm(lpsa~., data = temp_df)
  coeffs[c(1, vars+1), i+1] <- coefficients(fit)
}

# 
coeffs
```


### Extra: \*Compare estimates of the prediction error\*
I first misread the exercise, and ignored the "best-subset" part (sorry). 
Where we in the above use the prediction error estimates for model selection (to find the best predictor subset), we could also be interested in estimating the prediction error for a fixed model. 
In the following I apply the different procedures to estimate the prediction error in the full regression model, and compare these to the test error we obtain on the given test set.

#### Test and train:
```{r}

# initialize list for storing the error estimates
err <- list()

# compute error in training set
y     <- dfs$train$lpsa # observed training outcomes
yhat  <- fitted(fit)    # fitted values
err$train <- mean((y-yhat)**2)

# compute error in test set
y     <- dfs$test$lpsa 
yhat  <- predict(fit, newdata = dfs$test)   
err$test <- mean((y-yhat)**2)

```


#### AIC and BIC:
We estimate the AIC and BIC by computing the log-likelihood of the fitted model:
```{r}
N <- nrow(dfs$train) # number of training samples
d <- 8 +2            # number of estimated params (incl. intercept and res variance)

# compute log-likelihood of fitted model 
logL <- as.numeric(logLik(fit)) 
cat("logLik:", logL)

# compare with "by-hand"-computation
y     <- dfs$train$lpsa     # observed training outcomes
yhat  <- fitted(fit)        # fitted values
s2    <- mean((y-yhat)**2)  # MLE of the res. variance (biased estimator) 
cat("by-hand:", sum(log(dnorm(y, mean = yhat, sd = sqrt(s2)))))

err$aic <- 2*d/N - 2/N*logL
err$bic <- d*log(N) - 2*logL

```


#### Cross-validation:
Use the function from problem 1 to apply CV:
```{r}
yvar <- "lpsa"
err$cv5  <- mean(cv_using_lm(5, dfs$train, yvar))
err$cv10 <- mean(cv_using_lm(10, dfs$train, yvar))
```


#### Bootstrap .632:

The .632 bootstrap estimator estiamtes the prediction error as a weighted average of the training error and the leave-one-out bootstrap estimator. Here, I have re-used the code in `r_code_lecture_6.r` to compute the bootstrap error:
```{r}

B <- 1000
bootstrap.err.b <- matrix(NA, nrow = B, ncol = 1) 

for (b in 1:B) {
  set.seed(b) # set seed, useful for data replication
  index <- sample(N, replace = TRUE)
  # the selected observations form the training set
  temp.train.data <- dfs$train[index, ]
  # the rest the test set
  temp.test.data <- dfs$train[-index, ]
  
  fit <- lm(lpsa ~ ., data = temp.train.data)
  
    # compute the error
  computeError <- function(model, newdata)
    mean((newdata[ , 'lpsa'] - predict(model, newdata = newdata))^2)
  
  # save the error in the matrix initialized above
  bootstrap.err.b[b] <- computeError(fit, temp.test.data)
  
  
  if(b%%100 == 0) print(b)
}

# compute the average error 
err$bootstrap <- mean(bootstrap.err.b)

# 0.632 bootstrap
err$bootstrap_632 <- 0.368*err$train + 0.632*err$bootstrap


```


#### Compare the error estimates:

```{r}
err
```

The different error estimates are quite different. Partly this is due to which error they do actually estimate.
Normally, we are interested in obtaining the lowest test error, that is expected prediction error ***conditional*** on the particular training data we have ($Err_\mathcal{T}$ in the lingua of Hastie et al.). Both cross-validation methods and the bootstrap estimates are typically closer to the **unconditional** error, the expected test error $Err = Err_\mathcal{T}$. See ch. 7.12 in the ELS book for a discussion. The AIC estimates again estimates another error, namely the **in-sample** error ($Err_{in}$), while the BIC criteria has a Bayesian motivation: finding the most likely model given the data. In other words, direct comparison of these errors is difficult.