--- title: "Exercises day 6" author: "Alfonso Diz-Lois" format: pdf editor: visual --- ### Exercise 15 At the lectures we looked an example concerning the opinion poll from February 2017 (cf. slide 2 from the lectures) We will in this exercise consider this example further. Of the n=935 persons who were interviewed by Norstat, y=sum(yi)=309 would have voted Ap #### a The following calculations reproduce the result from the lectures (cf. slide 4) ```{r} n=935 y=309 p=y/n se=sqrt(p*(1-p)/n) margin=1.96*se lower=p-margin upper=p+margin cbind(p,margin,lower,upper) ``` Do the calculations and check that you get the result from the lectures. #### b In the opinion poll, 122 of the persons interviewed would have voted Fremskrittspartiet (FrP) and 80 would have voted Senterpartiet (Sp). Repeat the calculations above for Fremskrittspartiet and Senterpartiet. How is the "margin of error" for these parties compared to the "margin of error" for Ap (cf. slide 4)? ```{r} #FRP y=122 p=y/n se=sqrt(p*(1-p)/n) margin=1.96*se lower=p-margin upper=p+margin c("FRP",round(c(p,margin,lower,upper),3)) #SP y=80 p=y/n se=sqrt(p*(1-p)/n) margin=1.96*se lower=p-margin upper=p+margin c("SP",round(c(p,margin,lower,upper),3)) ``` ```{r} p=seq(0.001,0.99,length.out=100) plot(p,sqrt(p*(1-p)/n),xlab="p",ylab="SE") ``` ### Exercise 16 confidence interval and tests for two proportions In this exercise we will use the results from the opinion polls from Norstat from February and March 2017 to investigate the change in the support for some political parties. In February 2017 Norstat asked $n_1=935$ individuals which party they would support if there had been election to the parliament tomorrow. Of these $y_1=230$ would have voted Høyre. One month later, in March, $n_2=927$ persons were interviewed and $y_2=208$ of these would have voted Høyre. #### a We start out by estimating the change in the support for Høyre with a 95 % confidence interval (cf. slide 6) Try to program such an interval yourself in R. A suggested solution is given at the bottom of this page. Comment on the results. ```{r} n1=935 y1=230 p1=y1/n1 se1=sqrt(p1*(1-p1)/n1) n2=927 y2=208 p2=y2/n2 se2=sqrt(p2*(1-p2)/n2) se=sqrt(se1^2+se2^2) change=p1-p2 margin=1.96*se lower=change-margin upper=change+margin round(c(change,margin,lower,upper),3) ``` #### b We then test the null hypothesis that the support for Høyre has not changed from February to March (cf. slide 8) ```{r} p=(y1+y2)/(n1+n2) se0=sqrt(p*(1-p)/n1+p*(1-p)/n2) z=(p1-p2)/se0 pval=2*(1-pnorm(abs(z))) round(c(z,pval),3) ``` Perform these commands and comment on the results. Is the null hypothesis rejected or not? How does this relate to the confidence interval computed earlier? #### c R has a command for comparing two proportions ```{r} hoyre=matrix(c(y1,y2,n1-y1,n2-y2),nrow=2) # give the data for H�yre in a 2x2 table (cf. slide 10) colnames(hoyre)<-c("y","rest") hoyre ``` ```{r} prop.test(hoyre,correct=F) ``` Perform these commands and check that the results agree with those obtained earlier. The prop.test-command give a chi squared statistic, not a z-value as we computed earlier. What is the relation between the two? $\chi^2 = z^2$ ```{r} a=rnorm(100000)^2 b=rnorm(100000)^2 c=rnorm(100000)^2 plot(density(a+b+c),col="blue") lines(density(rchisq(1000000,df=3)),col="green",lty=2) legend(legend=c("a+b+c","Chi-sq"),col=c("blue","green"),lty=c(1,2),"topright") ``` ### Exercise 17 In this exercise we will use data from the Western Collaborative Group Study (WCGS), a large epidemiological study designed to study risk factors for coronary heart disease (CHD). More than 3000 men were recruited to the study, and a number of (potential) risk factors were recorded at entry. The men were then followed for about ten years and it was recorded if they developed CHD or not. The WCGS data are used in many of the examples in Section 3.4 and Chapter 6 of the course text book byVittinghoff et al. ```{r} #Read data. WCGS.data = read.table(file="http://www.uio.no/studier/emner/matnat/math/STK4900/data/wcgs.txt", header=TRUE, sep="\t", na.strings=".") head(WCGS.data) ``` #### a We first look at the effect of smoking (which is a binary covariate: 0=non-smoker, 1=smoker). We start out by making a 2x2 table for the development of CHD or not for smokers and nonsmokers: ```{r} table(WCGS.data$smoke,WCGS.data$chd69) ``` From the table you will see that there were 1502 smokers and that 159 of these developed CHD. Of the 1652 non-smokers, 98 developed CHD We then estimate the proportion who developed CHD among smokers and non-smokers. We also estimate the relative risk and the odds ratio (cf. slides 17 and 19): ```{r} p1=159/1502 #Smokers p0=98/1652 #Non smokers RR=p1/p0 OR=(p1/(1-p1))/(p0/(1-p0)) round(c(p1,p0,RR,OR),4) ``` #### b ```{r} fit.smoke=glm(chd69~smoke,data=WCGS.data,family=binomial) summary(fit.smoke) ``` From a fitted logistic model we may estimate the odds ratio corresponding to a one unit's increase in a covariate (cf. slide 26 ). Use this result to compute the odds ratio for CHD for a smoker compared to a non-smoker. Compare with the result from question a. According to the definition it should be: ```{r} exp(fit.smoke$coefficients[2]) ``` Which is basically the same result we got before. ### c We then use logistic regression to study the effect of age (at entry to the study) for the risk of developing CHD: ```{r} fit.age=glm(chd69~age,data=WCGS.data,family=binomial) summary(fit.age) ``` From a fitted logistic model we may estimate the odds ratio corresponding to a given increase in a covariate (cf. slide 26). Compute the odds ratio for CHD for a one-year and a ten-year increase in age. Check that you get the same results as on pages 162-163 in the text book by Vittinghoff et al. ```{r} print(paste0("1 year: ",round(exp(1*fit.age$coefficients[2]),3))) print(paste0("10 years: ",round(exp(10*fit.age$coefficients[2]),3))) ```