--- title: "Exercise 4" output: html_document: default pdf_document: default date: "2024-02-05" --- ```{r setup, include=FALSE} knitr::opts_chunk$set( echo = TRUE, message = FALSE, warning = FALSE, results = "hold" ) ``` # Cross-validation #### (a) Simulate data ```{r} # simulate data set.seed(1) y <- rnorm(100) x <- rnorm(100) y <- x - 2 * x^2 + rnorm(100) # scatter plot plot(x, y) ``` We simulate from the following model: $$ y_i =\beta_0 + \beta_1*x_i + \beta_2*x_i^2 + \epsilon_i, \quad \epsilon_i \sim N(0, 1) \\ \beta_0 = 0, \beta_1 = 1, \beta_2 = -2, $$ #### (c)-(d) Run LOOCV Here, I have written a function that performs k-fold CV for the linear regression case. It poperates on a data frame and use the stats::lm functionality: ```{r} # function that computes the CV-error of LS-estimator: # - k: (integer) number of folds # - df: (data.frame) training data, includes both outcome and predictors # - yvar: (character) name of outcome variable in df # - formula: (formula) formula argument to stats::lm. Defaults to "yvar ~." cv_using_lm <- function(k, df, yvar, formula = as.formula(paste0(yvar, "~."))) { # compute fold-size N <- nrow(df) # assign the observations randomly k test sets fold <- sample(rep(seq_len(k), length.out = N)) #table(fold) # init vector for storing errors err <- numeric(k) for (kk in seq_len(k)) { # indicator for the test-observations in current fold inFold <- fold == kk # fit model to current train-set fit <- lm(formula, data = df[!inFold, ]) # predict current test-set outcomes yhat <- predict(fit, newdata = df[inFold, ]) # compute loss y <- df[[yvar]][inFold] err[kk] <- sum((y-yhat)**2)/length(y) } return(err) } # for each model, store the outcome and predictors in a data frame dfs <- list(data.frame(y = y, x)) dfs[[2]] <- cbind(dfs[[1]], x**2) dfs[[3]] <- cbind(dfs[[2]], x**3) dfs[[4]] <- cbind(dfs[[3]], x**4) # estimate prediction error with N-fold CV n <- length(y) set.seed(007) sapply(dfs, function(df) mean(cv_using_lm(n, df, "y"))) # estimate prediction error with N-fold CV set.seed(007+1) sapply(dfs, function(df) mean(cv_using_lm(n, df, "y"))) ``` There is only one way to split the data into N folds of size 1, therefore the error estimates are exactly the same. #### (e) Which model has the lowest error? The model fitting a second order poly has the lowest error. Since we have simulated the data, we know that this is the true model and expect therefore this model to be a good fit to the test-set in each fold. The training error, on the other hand, will always decrease by including additional regressors. #### (f) Compare significance of coefficient estimates ```{r} # compute p-vals for the coefficients and store them in a matrix p_vals <- matrix(nrow = 5, ncol = length(dfs)) for (i in seq_along(dfs)) { fit <- lm(y~., data = dfs[[i]]) coeffs <- coefficients(fit) sds <- sqrt(diag(vcov(fit))) t_vals <- coeffs/sds p_vals[seq_len(i+1), i] <- 2*(1-pt(abs(t_vals), n-length(coeffs))) } rownames(p_vals) <- names(coeffs) colnames(p_vals) <- paste0("Model", seq_along(dfs)) round(p_vals, 3) ``` We see that the coefficient on the first and second order term is significant, except from the first model. The intercept is significant only in this first model. The higher order terms is not significant. From the scatterplot in (b) wee see that there is a clear non-linear relationship between y and x. No straight line gives a good fit to these data. Therefore we wrongly conclude in model 1 that the intercept is different from 0 and the slope is not. # Ex 7.9 (Hastie 2009, p. 259) The exercise reads "Ex. 7.9 For the prostate data of Chapter 3, carry out a ***best-subset*** linear regression analysis, as in Table 3.3 (third column from left). Compute the AIC, BIC, five- and tenfold cross-validation, and bootstrap .632 estimates of prediction error. Discuss the results." Data can be downloaded from book-website: https://hastie.su.domains/ElemStatLearn/: - column 1-8 are predictors - column 9 is the outcome (lpsa) - column 10 divides the data into a test and train set #### Import data ```{r} # I let the -here- package help me with relative paths # this command defines the root-directory here::i_am("exercises/exercises_4_sol.Rmd") # read data from local file data <- read.csv(here::here("./data/prostate.txt"), sep = "\t", header = TRUE, stringsAsFactors = TRUE) data[1] <- NULL # remove first column (row-numbers) head(data) ``` As described in ch. 3.2.1, the predictors are scaled to a variance of 1 and the data is divided into test-and-train set. ```{r} # construct data.frame including scaled predictors and the outcome df <- data.frame(lapply(data[, 1:8], function(x) (x-mean(x))/sd(x)), lpsa = data$lpsa[]) sapply(df, var) # variance of each scaled variable # store test/train indicator as a variable dfs <- list(train = df[data$train == TRUE, ], test = df[data$train == FALSE, ]) ``` Now, we can reproduce the LS-estimates in table 3.3: ```{r} # fit linear regression model fit <- lm(lpsa ~ ., data = dfs$train) summary(fit) ``` ### Best subset selection We want to find the best subset of predictors to include in a linear regression model, using different criteria (AIC, BIC, CV, .632-bootstrap). We can re-use the CV-function we defined in problem 1 to apply 5- and 10-fold CV. To apply the .632 boostrap estimator, we need a similar function, for example: ```{r} # function that computes the .632boostrap error of the LS-estimator: # - k: (integer) number of folds # - df: (data.frame) training data, includes both outcome and predictors # - yvar: (character) name of outcome variable in df # - formula: (formula) formula argument to stats::lm. Defaults to "yvar ~." bootstrap_632_err <- function(B, df, yvar, formula = as.formula(paste0(yvar, "~."))) { N <- nrow(df) # compute bootstrap error bootstrap.err.b <- matrix(NA, nrow = B, ncol = 1) for (b in 1:B) { set.seed(b) # set seed, useful for data replication index <- sample(N, replace = TRUE) # the selected observations form the training set temp.train.data <- df[index, ] # the rest the test set temp.test.data <- df[-index, ] # fit model on temporary train set fit <- lm(as.formula(paste0(yvar, "~.")), data = temp.train.data) # compute the error computeError <- function(model, newdata) mean((newdata[ , yvar] - predict(model, newdata = newdata))^2) # save the error in the matrix initialized above bootstrap.err.b[b] <- computeError(fit, temp.test.data) #if(b%%100 == 0) print(b) } # compute training error: fit <- lm(as.formula(paste0(yvar, "~.")), data = df) err.train <- computeError(fit, df) # return 0.632 bootstrap error return(0.368*err.train + 0.632*mean(bootstrap.err.b)) } # test function on full training data set bootstrap_632_err(1000, dfs$train, "lpsa") ``` Now, we can loop through every possible subset of the predictors and estimate the prediction error with the different methods. *Note*: There are probably more clever way to do this, but this what the first that come to my mind during the unexpected live-coding-session in class. ```{r} set.seed(007) p <- 8 N <- nrow(dfs$train) err <- list() # for each subset of size m ... for (m in seq_len(p)) { # list all subset of size m of the p variables all_comb <- combn(p, m) # initialize a matrix for storing the errors tmp <- matrix(nrow = 5, ncol = ncol(all_comb)) rownames(tmp) <- c("aic", "bic", "cv5", "cv10", "boostrap632") err[[m]] <- tmp # for each size m subset... for (i in seq_len(ncol(all_comb))){ # construct a data.frame containint the outcome and current predictor variables vars <- all_comb[, i] temp_df <- data.frame(lpsa = dfs$train$lpsa, dfs$train[, vars]) # fit the model fit <- lm(lpsa~., data = temp_df) d <- m + 2 # number of estimated params: number of coefficients + intercept + residual variance # evaluate log-lik logLik <- logLik(fit) # aic r <- 1 err[[m]][r, i] <- 2*d/n-2/N*logLik # bic r <- r+1 err[[m]][r, i] <- d*log(N) - 2*logLik # cv 5-fold r <- r+1 err[[m]][r, i] <- mean(cv_using_lm(5, temp_df, "lpsa")) # cv 10-fold r <- r+1 err[[m]][r, i] <- mean(cv_using_lm(10, temp_df, "lpsa")) # bootstrap .632 # (note that I sat B = 100 to reduce computation time) r <- r+1 err[[m]][r, i] <- bootstrap_632_err(B = 100, df = temp_df, yvar = "lpsa") } } ``` Now, we want to find which subset is the best according to the different criteria. Here, I do so by collecting the estimates in a data frame in order to use the dplyr package to find the minimum over all subsets. ```{r} library(dplyr) df <- setNames(reshape2::melt(err), c("crit", "comb", "value", "size")) %>% select(crit, size, comb, value) head(df) ``` Then, I can group the data by each criteria, and find the minimum error. ```{r} res <- df %>% group_by(crit) %>% arrange(crit) %>% filter(value == min(value)) head(res) ``` We see that the different subsets (identified by size and comb variables) are associated with the lowest errors depending on the criteria used. Finally, we can re-fit each of the "best subsample"-sets chosen by each criteria on the full training data set. ```{r} # fit linear regresion full model fit <- lm(lpsa ~., dfs$train) beta <- coefficients(fit) # init matrix for storing the coefficient estimates of each subset coeffs <- matrix(nrow = length(beta), ncol = nrow(res) +1) coeffs[, 1] <- beta colnames(coeffs) <- c("LS", levels(res$crit)[res$crit]) rownames(coeffs) <- names(beta) # fit models to each of the chosen subsets for (i in 1:nrow(res)) { m <- res$size[i] comb <- res$comb[i] # find the variables included in the subset vars <- combn(p, m)[, comb] # construct a data frame with this subset of variables temp_df <- data.frame(lpsa = dfs$train$lpsa, dfs$train[, vars]) # fit the model and collect coefficient estimates fit <- lm(lpsa~., data = temp_df) coeffs[c(1, vars+1), i+1] <- coefficients(fit) } # coeffs ``` ### Extra: \*Compare estimates of the prediction error\* I first misread the exercise, and ignored the "best-subset" part (sorry). Where we in the above use the prediction error estimates for model selection (to find the best predictor subset), we could also be interested in estimating the prediction error for a fixed model. In the following I apply the different procedures to estimate the prediction error in the full regression model, and compare these to the test error we obtain on the given test set. #### Test and train: ```{r} # initialize list for storing the error estimates err <- list() # compute error in training set y <- dfs$train$lpsa # observed training outcomes yhat <- fitted(fit) # fitted values err$train <- mean((y-yhat)**2) # compute error in test set y <- dfs$test$lpsa yhat <- predict(fit, newdata = dfs$test) err$test <- mean((y-yhat)**2) ``` #### AIC and BIC: We estimate the AIC and BIC by computing the log-likelihood of the fitted model: ```{r} N <- nrow(dfs$train) # number of training samples d <- 8 +2 # number of estimated params (incl. intercept and res variance) # compute log-likelihood of fitted model logL <- as.numeric(logLik(fit)) cat("logLik:", logL) # compare with "by-hand"-computation y <- dfs$train$lpsa # observed training outcomes yhat <- fitted(fit) # fitted values s2 <- mean((y-yhat)**2) # MLE of the res. variance (biased estimator) cat("by-hand:", sum(log(dnorm(y, mean = yhat, sd = sqrt(s2))))) err$aic <- 2*d/N - 2/N*logL err$bic <- d*log(N) - 2*logL ``` #### Cross-validation: Use the function from problem 1 to apply CV: ```{r} yvar <- "lpsa" err$cv5 <- mean(cv_using_lm(5, dfs$train, yvar)) err$cv10 <- mean(cv_using_lm(10, dfs$train, yvar)) ``` #### Bootstrap .632: The .632 bootstrap estimator estiamtes the prediction error as a weighted average of the training error and the leave-one-out bootstrap estimator. Here, I have re-used the code in `r_code_lecture_6.r` to compute the bootstrap error: ```{r} B <- 1000 bootstrap.err.b <- matrix(NA, nrow = B, ncol = 1) for (b in 1:B) { set.seed(b) # set seed, useful for data replication index <- sample(N, replace = TRUE) # the selected observations form the training set temp.train.data <- dfs$train[index, ] # the rest the test set temp.test.data <- dfs$train[-index, ] fit <- lm(lpsa ~ ., data = temp.train.data) # compute the error computeError <- function(model, newdata) mean((newdata[ , 'lpsa'] - predict(model, newdata = newdata))^2) # save the error in the matrix initialized above bootstrap.err.b[b] <- computeError(fit, temp.test.data) if(b%%100 == 0) print(b) } # compute the average error err$bootstrap <- mean(bootstrap.err.b) # 0.632 bootstrap err$bootstrap_632 <- 0.368*err$train + 0.632*err$bootstrap ``` #### Compare the error estimates: ```{r} err ``` The different error estimates are quite different. Partly this is due to which error they do actually estimate. Normally, we are interested in obtaining the lowest test error, that is expected prediction error ***conditional*** on the particular training data we have ($Err_\mathcal{T}$ in the lingua of Hastie et al.). Both cross-validation methods and the bootstrap estimates are typically closer to the **unconditional** error, the expected test error $Err = Err_\mathcal{T}$. See ch. 7.12 in the ELS book for a discussion. The AIC estimates again estimates another error, namely the **in-sample** error ($Err_{in}$), while the BIC criteria has a Bayesian motivation: finding the most likely model given the data. In other words, direct comparison of these errors is difficult.