---
title: "Exercises day 9"
author: "Alfonso Diz-Lois"
format: pdf
editor: visual
---

## Exercise 23

The data are from a old study of children with leukemia. The children were treated so that they had no symptoms (one say that they are in remission). They were then given an active treatment ("Drug") to keep the symptoms from coming back or placebo ("Control"). Time until the symptoms came back (called relapse) were measured. For some children the symptoms had not come back at the end of the study. These give rise to censored observations.

```{r}
# Read data.
leukemia.data = read.table(file="http://www.uio.no/studier/emner/matnat/math/STK4900/data/gehan.txt", header=TRUE)
# Take a look at the data.
leukemia.data
# "time" is time in weeks to relapse or censoring
# "cens" is 1 for relapse and 0 for censoring
# "treat" is 1 for "Control" and 2 for "Drug"
```

### 1

Compute Kaplan-Meier estimates for the two groups (without confidence intervals)

```{r}
library(survival)
surv.obj.1 = survfit(Surv(time,cens)~treat, conf.type="none", data=leukemia.data)
summary(surv.obj.1)
# Make sure you understand what the output tells!
# See Slide 12, Lecture 9.
```

```{r}
# Plot the Kaplan-Meier estimates.
plot(surv.obj.1, lty=1:2)
abline(h=0.5, col="red", lty=2)
# Interpret the plots.
# Read from the plots (approximately) what is the median time to relapse for the two groups
# Check the results by giving the command print(surv.obj.1)
```

### 2

Compute Kaplan-Meier estimates for the two groups with confidence intervals (the default confidence interval in R is not a good choice, so we make an explicit choice of the type of confidence interval)

```{r}
surv.obj.2 = survfit(Surv(time,cens)~treat, conf.type="plain", data=leukemia.data)
summary(surv.obj.2)
```

```{r}
plot(surv.obj.2, conf.int=TRUE, lty=1:2)
abline(h=0.5, col="red", lty=2)
```

### 3

Log-rank test for difference between the groups:

```{r}
survdiff(Surv(time,cens)~treat, data=leukemia.data)
```

Estimate the Expected events manually:

```{r}

calculate_expected_events <- function(data) {
  # Sort data by time
  data <- data[order(data$time),]
  
  # Initialize vectors to store the number at risk and the number of events
  n_risk <- nrow(data)
  n_events <- sum(data$cens)
  
  # Initialize vectors to store the expected number of events for each group
  expected_events_group1 <- numeric(n_events)
  expected_events_group2 <- numeric(n_events)
  
  # Iterate over the unique time points
  for (time_point in unique(data$time)) {
    at_risk_total <- sum(data$time >= time_point) # total population at risk at time t
    events_at_time <- sum(data$time == time_point & data$cens == 1) # number of events at time t
    
    # Only proceed if there are events at this time point
    if (events_at_time > 0) {
      at_risk_group1 <- sum(data$time >= time_point & data$treat == 1) # population at risk in group 1 at time t
      at_risk_group2 <- sum(data$time >= time_point & data$treat == 2) # population at risk in group 2 at time t
      expected_events_group1[time_point] <- at_risk_group1 * events_at_time / at_risk_total
      expected_events_group2[time_point] <- at_risk_group2 * events_at_time / at_risk_total
    }
  }
  
  # Return the sum of the expected events for each group
  list(expected_group1 = sum(expected_events_group1),
       expected_group2 = sum(expected_events_group2))
}

# Example usage:
leukemia.expected <- calculate_expected_events(leukemia.data)
print(leukemia.expected)
```

This function calculates the expected number of events for each group sequentially across each time point and then aggregate the result

What does the output tell you?

Both survival functions are significantly different

### 4

Cox regresion with treatment group as covariate:

```{r}
cox.obj = coxph(Surv(time,cens)~factor(treat), data=leukemia.data)
summary(cox.obj)
# Interpret the results!

```

## Exercise 24: Survival Analysis

In the period 1962-77 a total of 205 patients with malignant melanoma were operated at Odense University hospital in Denmark. The patients were followed up until death or censoring at the end of the study at 31/12-1977. The data are available at the course web-page.

As we are interested in the mortality from malignant melanoma, we will in this exercise treat death from other causes as censored observations in the analysis.

```{r}
# Clean up the memory before we start.
rm(list=ls(all=TRUE))

# Read the data into a dataframe:
melanoma.data = read.table("http://www.uio.no/studier/emner/matnat/math/STK4900/data/melanoma.txt", header=TRUE)

# Load the R-library for survival analysis (if you have not already done so):
library(survival)
```

### a

Make Kaplan-Meier plots for each of the two genders, and test the difference using the logrank test. Discuss the results.

```{r}
# Compute Kaplan-Meier estimates for the two groups (males, females) (without confidence intervals).
surv.obj = survfit(Surv(lifetime,status==1)~sex, data=melanoma.data)
# Plot the Kaplan-Meier estimates.
plot(surv.obj, lty=1:2, mark.time=FALSE)
# Log-rank test for difference between the groups:
survdiff(Surv(lifetime,status==1)~sex, data=melanoma.data)

```

EXAMPLE ON HOW TO APPLY FORMAT TO THE PLOT:

```{r}
surv.obj = survfit(Surv(lifetime,status==1)~1, data=melanoma.data[melanoma.data$sex==2,])
# Plot the Kaplan-Meier estimates.
plot(surv.obj, lty=c(1,2,2), mark.time=FALSE,col=c("black","blue","blue"))
```

### b

Repeat the analysis in a) for each of the three groups of tumor thickness: 0-1 mm, 2-5 mm, and 5+ mm.

```{r}
# Take a look at the distribution of grthick (grouped tumor thickness)
table(melanoma.data[,"grthick"])

# Compute Kaplan-Meier estimates for the two groups (males, females) (without confidence intervals).
surv.obj = survfit(Surv(lifetime,status==1)~grthick, data=melanoma.data)
# Plot the Kaplan-Meier estimates.
plot(surv.obj, lty=1:3, mark.time=FALSE)
legend("bottomleft",legend=c("1","2","3"),lty=1:3)
# Log-rank test for difference between the groups:
survdiff(Surv(lifetime,status==1)~grthick, data=melanoma.data)
```

### c

Repeat the analysis in a) for ulceration. (Ulceration is present if the surface of the melanoma viewed in a microscope shows signs of ulcers and absent otherwise.)

```{r}
# Take a look at the distribution of ulcer (ulceration)
table(melanoma.data[,"ulcer"])
# Compute Kaplan-Meier estimates for the two groups (males, females) (without confidence intervals).
surv.obj = survfit(Surv(lifetime,status==1)~ulcer, data=melanoma.data)
# Plot the Kaplan-Meier estimates.
plot(surv.obj, lty=1:2, mark.time=FALSE)
legend("bottomleft",legend=c("1","2"),lty=1:2)
# Log-rank test for difference between the groups:
survdiff(Surv(lifetime,status==1)~ulcer, data=melanoma.data)


```

### d

Make a univariate Cox regression analysis of each of the covariates sex, tumor thickness and ulceration. For the numeric covariate tumor thickness you should consider whether you ought to use the original version "thickn", the log-transformed version "logthick", or the grouped version "grthick" (as a factor). Interpret the results and compare with the results in a-c.

```{r}

# Commands for sex
coxfit.sex=coxph(Surv(lifetime,status==1)~factor(sex),data=melanoma.data)
summary(coxfit.sex)
# The other covariates in question d are done by similar commands, and so is question e


```

```{r}
coxfit.sex=coxph(Surv(lifetime,status==1)~sex,data=melanoma.data)
summary(coxfit.sex)
```