# STK2100 # Exercises for January 30th 2023. # Lars H. B. Olsen # READ: # On mac press cmd + shift + o to get an overview of the document. # Exercise 2.7 in 'Elements of Statistical Learning' ------------------------------------------------------------------------- # See pages 8-9 in /studier/emner/matnat/math/STK2100/v23/Exercise%20solutions/esl_ch2ex.pdf. # Exercise 2.1 in 'Introduction to Statistical Learning in R' ---------------------------------------------------------------- # For each of parts (a) through (d), indicate whether we would generally expect # the performance of a flexible statistical learning method to be better or # worse than an inflexible method. Justify your answer. # (a) # Q: The sample size n is extremely large, and the number of predictors p is small. # A: better - a more flexible approach will fit the data closer and with the # large sample size a better fit than an inflexible approach would be obtained # (b) # Q: The number of predictors p is extremely large, and the number of observations n is small. # A: worse - a flexible method would overfit the small number of observations # (c) # Q: The relationship between the predictors and response is highly non-linear. # A: better - with more degrees of freedom, a flexible model would obtain a # better fit # (d) # Q: The variance of the error terms, i.e. \sigma^2 = Var(\epsilon), is extremely high. # A: worse - flexible methods fit to the noise in the error terms and increase variance # Exercise 2.1 in 'Introduction to Statistical Learning in R' ---------------------------------------------------------------- # Explain whether each scenario is a classification or regression problem, # and indicate whether we are most interested in inference or prediction. # Finally, provide n and p. # (a) # Q: We collect a set of data on the top 500 firms in the US. For each firm we # record profit, number of employees, industry and the CEO salary. # We are interested in understanding which factors affect CEO salary. # A: Regression and inference. Quantitative output of CEO salary based on CEO # firm's features. # n - 500 firms in the US, # p - profit, number of employees, industry # (b) # Q: We are considering launching a new product and wish to know whether it will # be a success or a failure. We collect data on 20 similar products that were # previously launched. For each prod- uct we have recorded whether it was a # success or failure, price charged for the product, marketing budget, # competition price, and ten other variables. # A: Classification and prediction. Predicting new product's success or failure. # n - 20 similar products previously launched # p - price charged, marketing budget, comp. price, ten other variables # (c) # Q: We are interested in predicting the % change in the USD/Euro exchange rate # in relation to the weekly changes in the world stock markets. Hence we # collect weekly data for all of 2012. For each week we record the % change # in the USD/Euro, the % change in the US market, the % change in the # British market, and the % change in the German market. # A: Regression and prediction. Quantitative output of % change # n - 52 weeks of 2012 weekly data # p - % change in US market, % change in British market, % change in German market # Exercise 2.8 in 'Introduction to Statistical Learning in R' ---------------------------------------------------------------- # Download package that contains all the data used in # An Introduction to Statistical Learning: with Applications in R # install.packages("ISLR") library("ISLR") attach(College) data(College) # # Or we can directly download it from the webpage # College <- read.csv("https://statlearning.com/s/College.csv", header = TRUE) # summary(College) # # Or load it from the computer (if you have it in the same folder as this file) # College = read.csv(paste(getwd(), "/College.csv", sep = "")) # summary(College) # 8. (b) # If we used the first method to get College data, then we can skip this. # View(College) # fix(College) # rownames(College) = College[,1] # College = College[,-1] # View(College) # fix(College) # 8. (c) # i. head(College) summary(College) # ii. # To make it work I needed to manually set some of the features as categorical. # This was not necessary before, maybe an update in R. I do not know. # Not needed if use ISLR library method. College$Private = factor(College$Private) pairs(College[,1:10]) # iii. plot(College$Private, College$Outstate) # iv. Elite = rep("No", nrow(College)) Elite[College$Top10perc > 50] = "Yes" Elite = as.factor(Elite) College = data.frame(College, Elite) summary(College$Elite) plot(College$Elite, College$Outstate) # v. par(mfrow=c(2,2)) hist(College$Apps) hist(College$perc.alumni, col=2) hist(College$S.F.Ratio, col=3, breaks=10) hist(College$Expend, breaks=100) # vi. par(mfrow=c(1,1)) # High tuition correlates to high graduation rate. plot(College$Outstate, College$Grad.Rate) # Colleges with low acceptance rate tend to have low S:F ratio. plot(College$Accept / College$Apps, College$S.F.Ratio) # Colleges with the most students from top 10% perc don't necessarily have # the highest graduation rate. Also, rate > 100 is erroneous! plot(College$Top10perc, College$Grad.Rate)