--- title: "Exercises 4.28, 4.29 and 4.30" author: "Per August Jarval Moen" date: "2024" output: pdf_document --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) ``` # Exercise 4.28 We begin by loading the data. ```{r } data = read.table("https://users.stat.ufl.edu/~aa/glm/data/Houses.dat",header=TRUE) data = subset(data, select = -case) data$new = factor(data$new) ``` Here we have removed the column \textit{case} as it is simply the observation number. We now summerize the data with some descriptive statistics and plots. First, the descriptive statistics: ```{r } summary(data) ``` Then some histograms: ```{r } library(ggplot2) library(tidyr) ggplot(data, aes(x = price)) + geom_histogram(binwidth = 20, color = "black") + labs(title = "Histogram of House Prices", x = "Price", y = "Frequency") + theme(plot.title = element_text(hjust = 0.5)) ``` ```{r } library(ggplot2) library(tidyr) ggplot(data, aes(x = taxes)) + geom_histogram(binwidth = 200, color = "black") + labs(title = "Histogram of House taxes", x = "Taxes", y = "Frequency") + theme(plot.title = element_text(hjust = 0.5)) ``` ```{r } library(ggplot2) library(tidyr) ggplot(data, aes(x = beds)) + geom_histogram(binwidth = 1, color = "black") + labs(title = "Histogram of number of bedrooms", x = "Bedrooms", y = "Frequency") + theme(plot.title = element_text(hjust = 0.5)) ``` ```{r } library(ggplot2) library(tidyr) ggplot(data, aes(x = baths)) + geom_histogram(binwidth = 1, color = "black") + labs(title = "Histogram of number of bathrooms", x = "Baths", y = "Frequency") + theme(plot.title = element_text(hjust = 0.5)) ``` ```{r } library(ggplot2) library(tidyr) ggplot(data, aes(x = size)) + geom_histogram(binwidth = 200, color = "black") + labs(title = "Histogram of size", x = "Size", y = "Frequency") + theme(plot.title = element_text(hjust = 0.5)) ``` Finally, we plot a scatter plot between all variables. Here we plot new and old houses separately (new houses in blue). ```{r} library(GGally) ggpairs(data, aes(colour = new, alpha = 0.4)) ``` Now, let's fit a linear model with \textbf{price} as response variable. We will use forward selection as variable selection procedure. There are many ways to to this. One way is using the library \textbf{olsrr}, where the function \textbf{ols\_step\_best\_subset} is made specifically for linear regression models: ```{r} library(olsrr) fullmodel = lm(price ~ ., data = data) ols_step_best_subset(fullmodel) ``` Here we can choose among several selection criteria, including AIC and adjusted R squared. In this exercise we choose according to AIC, which can also be done as follows: ```{r} intercept_only = lm(price ~1, data = data) AICmodel = step(intercept_only, direction='forward', scope=formula(fullmodel), trace=1) AICmodel ``` Selecting a model according to AIC, we choose the model where \textbf{taxes,new, size} are used as covariates. If we were to choose according to the adjusted R squared, we would have chosen \textbf{taxes,beds, new, size} as covariates (from the output of \textbf{ols\_step\_best\_subset}). Let's take a closer look at the model chosen using AIC: ```{r} AICmodel = lm(price~ taxes + new + size, data = data) summary(AICmodel) ``` We see that all of the three chosen covariates are statistically significant at level 1%. Two of the chosen covariates, \textbf{new} and \textbf{size} are reasonable variables to include in the model, as these likely are related to price in a causal fashion (i.e. an increase in size \textit{causes} an increase in the price). However, we also see that \textbf{taxes} is chosen a covariate in the model. It is less concievable that an increase in tax also \textit{causes} an increase in the price of the house (rather, it is reasonable to assume that an increase in price also increases the tax). As for interpretation, we do so as such: - Keeping the other covariates fixed, a unit increase in taxes is associated with an estimated $0.037$ unit increase in price on average. - Keeping the other covariates fixed, a unit increase in size is associated with an estimated $0.062$ unit increase in price on average. - New houses are estimated to cost $46.37$ units more than old houses on average, for fixed values of taxes and size. Lastly, we check if our results depend on any influential observations. We first identify influential observations by considering Cook's distance. We do so both for the full model and the model chosen by AIC: ```{r} cooksdistances_full = cooks.distance(fullmodel) plot(cooksdistances_full) which(cooksdistances_full>1) cooksdistances = cooks.distance(AICmodel) plot(cooksdistances) which(cooksdistances>0.8) ``` As in the textbook, we see that observation $64$ is influential both in the full model and the model chosen by AIC. Let's choose a new model, also according to stepwise selection using AIC, when we exclude observation $64$: ```{r} data_without_64 = data[-c(64),] fullmodel_without_64 = lm(price ~., data = data_without_64) intercept_only_without_64 = lm(price ~1, data = data_without_64) AICmodel_without_64 = step(intercept_only_without_64, direction='forward', scope=formula(fullmodel_without_64), trace=1) AICmodel_without_64 ``` We see that we choose a different model when omitting observation $64$. Without this observation, the forward selection choice of model (with AIC) is to use \textbf{size}, \textbf{taxes}, \textbf{beds} and \textbf{new} as covariates. # Exercise 4.29 ## a Now let's choose a model using backward selection with variable significance as selection criterion. Also, for the initial model, we'll include all pairwise interactions between the covariates. For simplicity, we'll use the function \textbf{step}. This function removes the variable with the largest p-value at each iteration. However, the final model chosen is based on AIC. We start by fitting the full model with pairwise interactions, and ten apply \textbf{step}: ```{r} fullmodel = lm(price ~ (.)^2, data = data) AICmodel = step(fullmodel, trace=1, direction="backward", test="F") ``` The model chosen by \textbf{step} is via AIC, while we only want to keep the significant variables. Using the output above, we can do it by removing the single non-significant variable still left in the final model in the step-wise procedure. For some reason, the step function does not discard any non-interaction terms, and we therefore manually remove the non-significant non-interaction terms as well. ```{r} chosenmodel = lm(price ~ new + size + taxes:baths + taxes:new + baths:size + new:size, data = data) summary(chosenmodel) ``` The interpretation of this model is slightly more difficult than the model in exercise 4.28. We do not interpret each parameter in the model, but only consider some: - According to the chosen model, a unit increase in taxes is associated with an estimated 0.1 unit decrease in price for new houses, and has no effect on old houses. - A unit increase in size is associated with an estimated 0.085 unit increase in price for old houses and an estimated 0.085 + 0.2 = 0.285 unit increase in price for new houses. ## b Let us compare the chosen model from (a) with a simple model with size, new, and taxes as covariates. ```{r} simplemodel = lm(price ~ size + new + taxes,data = data) summary(simplemodel) ``` The adjusted R\^2 for the simple model is 0.783, while the adjusted R\^2 for the chosen model is 0.858. In this sense, the chosen model is better than the simple model. The AIC for the simple model is `r round(AIC(simplemodel),digits =2)`, while the AIC for the chosen model is `r round(AIC(chosenmodel),digits =2)`. Also in this sense, the chosen model is preferred over the simple model. Interestingly, the chosen model captures a large effect of \textbf{size} on the price of new houses. ## c This can be done in the exact same fashion as in exercise 4.28. # Exercise 4.30 We analyze the same data, this time assuming that the response has a gamma distribution. ## a First we consider the identity link function. We first fit the full model: ```{r} gamma_id = glm(price ~ ., data = data, family = Gamma(link = identity)) summary(gamma_id) ``` Then we do backward selection with AIC as criterion: ```{r} gamma_id_AIC = step(gamma_id) summary(gamma_id_AIC) ``` The covariates chosen by the backward selection are \textbf{taxes}, \textbf{beds}, \textbf{baths} and \textbf{size}. Since the link function is the identity, the interpretations here are the same as with the linear regression model. Note, however, that the gamma distribution only has support on the positive reals. Therefore, the linear predictor can never be allowed to be negative. ## b Now let us consider the log link. We first fit the full model: ```{r} gamma_log = glm(price ~ ., data = data, family = Gamma(link = log)) summary(gamma_log) ``` Then we do backward selection with AIC as criterion: ```{r} gamma_log_AIC = step(gamma_log) summary(gamma_log_AIC) ``` Since the link function is now the natural logarithm, the interpretation of the estimated coefficients are different. - The estimated expected value of an old house with 0 taxes and size 0 is $\exp(\widehat{\beta}_0) = 57$ - A unit increase in taxes is associated with an estimated $0.02$\% increase in price, for fixed values of new and size - A unit increase in size is associated with an estimated $0.027$\% increase in proce, for fixed values of new and tax - A new house has a price estimated to be $\exp(1.920/10) = 1.21$ times that of an old house of the same size and tax.