# STK2100
# Exercises for January 30th 2023.
# Lars H. B. Olsen

# READ: 
# On mac press cmd + shift + o to get an overview of the document.


# Exercise 2.7 in 'Elements of Statistical Learning' -------------------------------------------------------------------------
# See pages 8-9 in /studier/emner/matnat/math/STK2100/v23/Exercise%20solutions/esl_ch2ex.pdf.


# Exercise 2.1 in 'Introduction to Statistical Learning in R' ----------------------------------------------------------------
# For each of parts (a) through (d), indicate whether we would generally expect
# the performance of a flexible statistical learning method to be better or
# worse than an inflexible method. Justify your answer.

# (a)
# Q: The sample size n is extremely large, and the number of predictors p is small.
# A: better - a more flexible approach will fit the data closer and with the
#    large sample size a better fit than an inflexible approach would be obtained

# (b)
# Q: The number of predictors p is extremely large, and the number of observations n is small.
# A: worse - a flexible method would overfit the small number of observations

# (c)
# Q: The relationship between the predictors and response is highly non-linear.
# A: better - with more degrees of freedom, a flexible model would obtain a
#    better fit

# (d)
# Q: The variance of the error terms, i.e. \sigma^2 = Var(\epsilon), is extremely high.
# A: worse - flexible methods fit to the noise in the error terms and increase variance


# Exercise 2.1 in 'Introduction to Statistical Learning in R' ----------------------------------------------------------------
# Explain whether each scenario is a classification or regression problem,
# and indicate whether we are most interested in inference or prediction.
# Finally, provide n and p.

# (a)
# Q: We collect a set of data on the top 500 firms in the US. For each firm we
#    record profit, number of employees, industry and the CEO salary.
#    We are interested in understanding which factors affect CEO salary.
# A: Regression and inference. Quantitative output of CEO salary based on CEO
#    firm's features. 
#    n - 500 firms in the US,
#    p - profit, number of employees, industry

# (b)
# Q: We are considering launching a new product and wish to know whether it will
#    be a success or a failure. We collect data on 20 similar products that were
#    previously launched. For each prod- uct we have recorded whether it was a
#    success or failure, price charged for the product, marketing budget,
#    competition price, and ten other variables.
# A: Classification and prediction. Predicting new product's success or failure.
#    n - 20 similar products previously launched
#    p - price charged, marketing budget, comp. price, ten other variables

# (c)
# Q: We are interested in predicting the % change in the USD/Euro exchange rate
#    in relation to the weekly changes in the world stock markets. Hence we
#    collect weekly data for all of 2012. For each week we record the % change
#    in the USD/Euro, the % change in the US market, the % change in the
#    British market, and the % change in the German market.
# A: Regression and prediction. Quantitative output of % change
#    n - 52 weeks of 2012 weekly data
#    p - % change in US market, % change in British market, % change in German market


# Exercise 2.8 in 'Introduction to Statistical Learning in R' ----------------------------------------------------------------
# Download package that contains all the data used in 
# An Introduction to Statistical Learning: with Applications in R
# install.packages("ISLR")
library("ISLR")
attach(College)
data(College)

# # Or we can directly download it from the webpage 
# College <- read.csv("https://statlearning.com/s/College.csv", header = TRUE)
# summary(College)

# # Or load it from the computer (if you have it in the same folder as this file)
# College = read.csv(paste(getwd(), "/College.csv", sep = ""))
# summary(College)

# 8. (b) 
# If we used the first method to get College data, then we can skip this.
# View(College)
# fix(College)
# rownames(College) = College[,1]
# College = College[,-1]
# View(College)
# fix(College)

# 8. (c)
# i.
head(College)
summary(College)

# ii.
# To make it work I needed to manually set some of the features as categorical.
# This was not necessary before, maybe an update in R. I do not know.
# Not needed if use ISLR library method.
College$Private = factor(College$Private)
pairs(College[,1:10])

# iii.
plot(College$Private, College$Outstate)

# iv.
Elite = rep("No", nrow(College))
Elite[College$Top10perc > 50] = "Yes"
Elite = as.factor(Elite)
College = data.frame(College, Elite)
summary(College$Elite)
plot(College$Elite, College$Outstate)

# v.
par(mfrow=c(2,2))
hist(College$Apps)
hist(College$perc.alumni, col=2)
hist(College$S.F.Ratio, col=3, breaks=10)
hist(College$Expend, breaks=100)

# vi.
par(mfrow=c(1,1))
# High tuition correlates to high graduation rate.
plot(College$Outstate, College$Grad.Rate)

# Colleges with low acceptance rate tend to have low S:F ratio.
plot(College$Accept / College$Apps, College$S.F.Ratio)

# Colleges with the most students from top 10% perc don't necessarily have
# the highest graduation rate. Also, rate > 100 is erroneous!
plot(College$Top10perc, College$Grad.Rate)