\documentclass[12pt]{article} 
%\usepackage{amsbsy} % for \boldsymbol and \pmb 
\usepackage{graphicx} % To include pdf files!
\usepackage{amsmath}
\usepackage{amsbsy}
\usepackage{amsfonts}
\usepackage[colorlinks=true, pdfstartview=FitV, linkcolor=blue, citecolor=blue, urlcolor=blue]{hyperref} % For links
\usepackage{fullpage}
%\pagestyle{empty} % No page numbers


\begin{document}
%\enlargethispage*{1000 pt} 

\begin{center}   
{\Large \textbf{STA 312s19 Assignment Six}}\footnote{This assignment was prepared by  \href{http://www.utstat.toronto.edu/~brunner}{Jerry Brunner},
Department of Mathematical and Computational Sciences, University of Toronto. It is licensed under a 
\href{http://creativecommons.org/licenses/by-sa/3.0/deed.en_US}
     {Creative Commons Attribution - ShareAlike 3.0 Unported License}. Use any part of it as you like and share the result freely. The \LaTeX~source code is available from the course website:
\href{http://www.utstat.toronto.edu/~brunner/oldclass/312s19} {\texttt{http://www.utstat.toronto.edu/$^\sim$brunner/oldclass/312s19}}}
\vspace{1 mm}
\end{center}

\noindent
The paper and pencil part of this assignment is not to be handed in. It is practice for Quiz~6 on February 25th. The R part may be handed in as part of the quiz. \textbf{Bring hard copy of your printout to the quiz}. Do not write anything on your printout in advance except possibly your name and student number. \vspace{5mm}

\noindent
The Kapan-Meier estimate of the survival function is based on discrete time. Accordingly, let the survival time $T$ be a discrete random variable with non-zero probability on the points $t_1, t_2, \dots$. Also, $t_0=0$, and $P(T=t_0) = 0$.

\begin{enumerate} 

\item Let $p_j = $ the probability of surviving past time $t_j$, given survival to time $t_{j-1}$. That is, $p_j = P(T>t_j|T>t_{j-1})$. Prove $p_j = \frac{S(t_j)}{S(t_{j-1})}$.

\item \label{prod} Prove $\displaystyle S(t_k) = \prod_{j=1}^k p_j$.

\item This question is background for the questions that follow. Let $X_1, \ldots, X_n$ be a random sample from a Bernoulli distribution with parameter $p$. That is, $P(X_i=1)=p$ and $P(X_i=0)=1-p$. You have already proved that the MLE of $p$ is $\widehat{p} = \bar{X}$, the sample proportion. You don't have to do it again.
    \begin{enumerate}
        \item Write down the expected value and variance of $\widehat{p}$. Only derive them is you don't know the answer. 
        \item The asymptotic variance is just the variance; there is no need to go through the Fisher information in this case. So, what is the asymptotic distribution of $\widehat{p}$? It's on the formula sheet if you can translate the symbols.
    \end{enumerate}

\item In a random sample of survival times (which can happen only at points $t_1, t_2, \dots$), let  $d_j$ be the number of deaths at time $t_j$, and let $n_j$ be the number of individuals at risk at time $t_j$. ``At risk" means not dead yet and not censored at time $t_j$. What is a reasonable estimate of $p_j$? Again, $p_j = $ the probability of surviving past time $t_j$, given survival to time $t_{j-1}$. Call the estimate $\widehat{p}_j$.

\item The estimate $\widehat{p}_j$ from the last question is clearly a sample proportion. Thinking of it as arising from a sample of Bernoullis (not quite true, but close), 
    \begin{enumerate}
        \item What should the asymptotic distribution of $\widehat{p}_j$ be? Just write down the answer.
        \item What should the asymptotic distribution of $\log\widehat{p}_j$ be? Show your work.
    \end{enumerate}


\item Based on Problem~\ref{prod}, the natural estimator of $S(t)$ is $\displaystyle \widehat{S}(t) = \prod_{t_j \leq t} \widehat{p}_j $. 
    \begin{enumerate}
        \item Write $\log \widehat{S}(t)$ as a sum.
        \item Based on the asymptotic distribution of $\log \widehat{p}_j$, what is the (asymptotic) expected value of $\log \widehat{S}(t)$?
        \item Based on the asymptotic distribution of $\log \widehat{p}_j$ and assuming the terms are independent (almost true), what is the (asymptotic) variance of $\log \widehat{S}(t)$?
        \item Based on the idea that the sum of normals is normal, what should the asymptotic distribution of $\log \widehat{S}(t)$ be?
    \end{enumerate}

\item Assuming that your answer to the previous question is correct (you can check the lecture slides on the Kaplan-Meier estimate), derive the asymptotic distribution of $\widehat{S}(t)$. Show your work.

\item \label{se} Based on your answer to the preceding question, give a reasonable standard error for  $\widehat{S}(t)$. This should be something you could compute from sample data.

\item Here is a table that is stolen directly from a nice book on survival analysis by Hosmer and Lemeshow. The similar table on p.~26 of our text is mixed up and wrong. In the table below, notice that two observations were censored between times 2 and 4; that's why there are only 83 individuals at risk at time 4, instead of 85. Fill in the empty cells.

\begin{center}
\renewcommand{\arraystretch}{1.5}
\begin{tabular}{|c|r|r|c|c|} \hline
$t_j$ & $n_j$  & $d_j$ & $\widehat{p}_j$ & $\widehat{S}(t_j)$ \\ \hline
  0    &  100  &   0   & \hspace{20mm}   &  \hspace{20mm}     \\ \hline 
  2    &  100  &  15   &                 &                    \\ \hline   
  4    &   83  &   5   &                 &                    \\ \hline   
  5    &   73  &  10   &                 &     0.6894         \\ \hline   
\end{tabular}
\renewcommand{\arraystretch}{1.0}
\end{center}               

\item In the table above, how many observations were censored between times 4 and 5?

\pagebreak

\item \label{km} The file 
\href{http://www.utstat.toronto.edu/~brunner/data/legal/expo.data2.txt}
     {\texttt{http://www.utstat.toronto.edu/$\sim$brunner/data/legal/expo.data2.txt}}
contains the data you used last week. Read the data and use R to compute the Kaplan-Meier estimate of the survival function.

    \begin{enumerate}
        \item \label{median} Give the Kaplan-Meier estimate of the medan, and a 95\% confidence interval. The answer is 3 numbers on your printout.
        \item Based on the output from \texttt{summary}, give $\widehat{S}(t)$ for $t = 0.062$. The answer is a number on your printout.
        \item Give $\widehat{p}_1$, $\widehat{p}_2$, $\widehat{p}_3$ and $\widehat{p}_4$. These are numbers that you calculate from the output of \texttt{summary}. Use R as a calculator and display the numbers on your printout.
        \item Reproduce $\widehat{S}(0.062)$ from the answer to your last question. Again, use R as a calculator and display the number on your printout.
        \item Again using R as a calculator, reproduce the standard error of $\widehat{S}(0.062)$ and display the number on your printout. You are calculating your answer to Question~\ref{se}.
        \item Plot the Kaplan-Meier estimate of the survival function, but don't print it yet. In the next question, you are going to add another curve to the plot.
    \end{enumerate}

\item \label{mle} Last week, you did a parametric analysis assuming these data are from an exponential distribution. You may want to re-use some of your code from last week.
    \begin{enumerate}
        \item Compute the maximum likelihood estimate and a 95\% confidence interval for the median. Compare your answer (3 numbers) to Question~\ref{median}.
        \item Add the MLE of $S(t)$ to your Kaplan-Meier plot, and print it. Bring your printout to the quiz. 
    \end{enumerate}




% \pagebreak %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%


\end{enumerate} % End of all the questions

\noindent
\textbf{Bring your printout for Questions~\ref{km} and~\ref{mle} to the quiz.}  All the requested numbers and the code that produced them should appear on your printout. Do not write anything on your printout in advance except possibly your name and student number. 


\end{document}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%