--- title: "Day 1 exercises" author: "Alfonso Diz-Lois" date: "2025-02-11" output: html_document --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) ``` ## Exercise 1: one-sample t-test and confidence interval Insert the data into a vector and compare the basic statistics: ```{r} age=c(249, 254, 243, 268, 253, 269, 287, 241, 273, 306, 303, 280, 260, 256, 278, 344, 304, 283, 310) show(mean(age)) show(median(age)) show(sd(age)) ``` It is always good practice to take a close look at the data. Sometimes, a picture is worth a thousand words (but not always): ```{r} hist(age) ``` Add some perspective with respect to the previous statistics: ```{r} hist(age) abline(v=mean(age),col="blue",lty=2,lwd=1.5) abline(v=median(age),col="green",lty=2,lwd=1.5) abline(v=mean(age)+sd(age),col="black",lty=3,lwd=1.5) abline(v=mean(age)-sd(age),col="black",lty=3,lwd=1.5) ``` Plot the empirical distribution function: ```{r} plot(ecdf(age),verticals=T, do.points=F,xlab="age") ``` Estimate the quartiles: ```{r} quantile(age) show(median(age)) ``` Boxplot ```{r} boxplot(age,main="Age boxplot") ``` Let's visualize the quartiles: ```{r} boxplot(age,main="Age boxplot") abline(h=quantile(age,probs=0.25),lty=4) abline(h=quantile(age,probs=0.75),lty=4) ``` Can we have points that go beyond the whiskers? According to boxplot they are identified as a maximum of 1.5 the interquartile range (by default), constrained by the range of the data. ```{r} #age_doped<-c(age,380) age_doped<-age #either: intqr=quantile(age_doped,probs=0.75)-quantile(age_doped,probs=0.25) boxplot(age_doped) abline(h=max(max(age),quantile(age_doped,probs=0.75)+1*intqr),lty=2,col=rgb(red =0,blue=1,green = 0,alpha=0.4)) abline(h=max(min(age_doped),quantile(age_doped,probs=0.25)-1.5*intqr),lty=2,col=rgb(red =0,blue=1,green = 0,alpha=0.4)) ``` ## Tests on the mean A t-student distribution with 18 degrees of freedom looks like this: ```{r} tsamples=rt(10000000,18) plot(density(tsamples),main="T-student; 18df") abline(v=quantile(tsamples,probs=c(0.025,0.975)),col="blue",lty=3) show(quantile(tsamples,probs=c(0.025,0.975))) show(qt(0.025,18)) show(qt(0.975,18)) ``` ```{r} show(mean(age) - qt(0.975,18)*(sd(age)/sqrt(19))) show(mean(age) + qt(0.975,18)*(sd(age)/sqrt(19))) ``` Similarly, using bootstrapping (note that the quantile function interpolates the values, ideally, one should sort the samples and estimate the quantile manually as suggested in the slides): ```{r} means<-c() for (j in 1:100000){ means<-c(means,mean(sample(age,size=length(age),replace = T))) } print(quantile(means,probs=c(0.025,0.975))) ``` Null Hipothesis: $ H_0:\mu = 265$ : Can we reject? According to the values we got we already know we cannot. Let's check: ```{r} tstat=(mean(age)-265)/(sd(age)/sqrt(19)) show(tstat) # This shows where we are in the distribution ``` ```{r} #Turn this into a p-value, or how likely we might be able to get this result at random: plot(density(tsamples),main="T-student; 18df") abline(v=quantile(tsamples,probs=c(0.025,0.975)),col="blue",lty=3) abline(v=tstat,col="red",lty=4) ``` We can see that the value is slightly below the upper bound, and therefore within the range at the specified level of confidence $1-\alpha$ ```{r} 1-pt(tstat,18) ``` What is this command doing exactly? Checking the upper bound of the cumulative function: ```{r} library(latex2exp) plot(seq(-5,5,0.01),pt(seq(-5,5,0.01),df=18),xlab=TeX(r"(\frac{(\bar{x}-\mu_0)}{\frac{s}{\sqrt{n}}})"),mgp=c(3,0,0)) abline(v=tstat,col="red",lty=4) abline(h=pt(tstat,18),col="red",lty=4) text(-4.5,pt(tstat,18)+.02,paste(round(pt(tstat,18),3))) ``` Alternatively, we can use the built-in R commands for the two-sided test: ```{r} #Two sided t.test(age,mu=265) ``` And the one sided if we want to compare whether the mean is larger than 265 or not: ```{r} t.test(age,alternative="greater",mu=265) ``` In the following code, we generate 1 million samples from the reference distribution (t-student 18 df), scale the standard deviation with the one we have in our data and add the mean. In other words, we get simulated samples from the theoretical mean. Then we estimate the proportion of the samples that are larger than the reference value, just as we did before with the one-sided test. As expected, the result is equal to the p-value we obtained above. ```{r} lb<-mean(age)+qt(0.05,df=18)*sd(age)/sqrt(length(age)) mean_samples<-(rt(1000000,df=18)*sd(age)/sqrt(length(age)))+mean(age) plot(density(mean_samples)) abline(v=lb,col="green",lty=3) sum(mean_samples<265)/length(mean_samples) ``` ## Exercise 2: At the lectures we looked an example on bone mineral density (cf. slide 25 from the lectures) We will in this exercise see how the computations for this example may be done in R. Start by reading the data into R. This may be done by the command: ```{r} cont=c(0.228, 0.207, 0.234, 0.220, 0.217, 0.228, 0.209, 0.221, 0.204, 0.220, 0.203, 0.219, 0.218, 0.245, 0.210) treat=c(0.250, 0.237, 0.217, 0.206, 0.247, 0.228, 0.245, 0.232, 0.267, 0.261, 0.221, 0.219, 0.232, 0.209, 0.255) ``` Find the means and standard deviations, and check that you get the same results as in the lectures (slide 25) ```{r} paste(mean(cont),mean(treat),sd(cont),sd(treat),sep = " ; ") ``` Use the command "t.test" to compute the confidence interval, t-statistic and P-value: ```{r} t.test(treat, cont , var.equal=T) ``` Make sure that you understand the output from the command and check that you get the same results as in the lectures (slides 27 and 28) Optional: Use the formulas given on slide 26 to compute the pooled standard deviation, the standard error of the effect of treatment, and the 95% confidence interval. ```{r} n1=length(cont) n2=length(treat) s1=sd(cont) s2=sd(treat) sp=sqrt(s1^2*(n1-1)/(n1+n2-2)+s2^2*(n2-1)/(n1+n2-2)) se=sp*sqrt(1/n1+1/n2) effect=mean(treat)-mean(cont) print(paste0("Effect of the treatment: ",effect, " ± ",round(qt(0.975,df = n1+n2-2)*se,4))) ``` Optional: Use the formulas given on slide 28 to compute the t-statistic and the P-value. ```{r} tstat=effect/se #statistic show(tstat) #Upper bound print(qt(0.975,df=n1+n2-2)) #Corresponding probability (p-value) for our statistic print(2*(1-pt(tstat,df=n1+n2-2))) ``` ## Exercise 3: Read the data into R and look at the data: ```{r} speed=scan("http://www.uio.no/studier/emner/matnat/math/STK4900/v17/exer2.dat") ``` ### Plot the data by a histogram. Also plot the empirical distribution function and make a box plot. What do the different plots tell you? Are there indications of "outliers" in the data? ```{r,figures-side, fig.show="hold"} par(mfrow=c(1,3)) hist(speed) boxplot(speed) plot(ecdf(speed),verticals=T, do.points=F,xlab="speed") ``` ### Compute the (empirical) mean and median. These are both measures of location. What do the two measures of location tell you? ```{r} print(paste(round(mean(speed),2),median(speed),sep=" ; ")) ``` The negative outliers pull the mean down. Median is more robust. ### Compute the (empirical) standard deviation and the interquartile range (i.e. the difference between the third and the first quartile). These are both measures of spread. What do the two measures of spread tell you? ```{r} show(as.numeric(quantile(speed,probs=0.75)-quantile(speed,probs=0.25))) #show(IQR(speed)) show(sd(speed)) ``` ### Compute a 95% confidence interval using all data. Also compute a 95% confidence interval without the two "outliers" (cf. question a). ```{r} #All data print(quantile(speed,probs=c(0.025,0.975))) #Without outliers print(quantile(speed[speed>0],probs=c(0.025,0.975))) ``` ### The true value has subsequently been shown to be 33.02. Comment on the confidence intervals in question d. ```{r} hist(speed) abline(v=quantile(speed,probs=c(0.025,0.975)),lty=4) abline(v=33.02,col="green",lty=2) ``` The "true" value is contained within the confidence interval.