\documentclass[10pt]{article} 
%\usepackage{amsbsy} % for \boldsymbol and \pmb 
\usepackage{graphicx} % To include pdf files!
\usepackage{amsmath}
\usepackage{amsbsy}
\usepackage{amsfonts}
\usepackage[colorlinks=true, pdfstartview=FitV, linkcolor=blue, citecolor=blue, urlcolor=blue]{hyperref} % For links
\usepackage{fullpage}
%\pagestyle{empty} % No page numbers


\begin{document}
%\enlargethispage*{1000 pt} 

\begin{center}   
{\Large \textbf{STA 312s19 Assignment Eight}}\footnote{This assignment was prepared by  \href{http://www.utstat.toronto.edu/~brunner}{Jerry Brunner},
Department of Mathematical and Computational Sciences, University of Toronto. It is licensed under a 
\href{http://creativecommons.org/licenses/by-sa/3.0/deed.en_US}
     {Creative Commons Attribution - ShareAlike 3.0 Unported License}. Use any part of it as you like and share the result freely. The \LaTeX~source code is available from the course website:
\href{http://www.utstat.toronto.edu/~brunner/oldclass/312s19} {\texttt{http://www.utstat.toronto.edu/$^\sim$brunner/oldclass/312s19}}}
\vspace{1 mm}
\end{center}

\noindent
The paper and pencil part of this assignment is not to be handed in. It is practice for Quiz~8 on March 11th. The R part may be handed in as part of the quiz. \textbf{Bring hard copy of your printout to the quiz}. Do not write anything on your printout in advance except possibly your name and student number. %\vspace{5mm}

\begin{enumerate} 

\item Consider the multiplicative regression model for the failure time: $t = e^{\beta_0+\beta_1x}\times \epsilon$, where $\beta_0$ and $\beta_1$ are unknown constants (parameters), $x$ is a known, observed constant, and $\epsilon \sim \exp(1)$. 
    \begin{enumerate}
        \item Derive the probability density function of $t$. Do it directly, not the way it was done in the lecture slides.
        \item Using the formula sheet, write down the
                \begin{enumerate}
                    \item Expected value of $t$.
                    \item Median of $t$.
                    \item Survival function of $t$.
                \end{enumerate}
        \item Give the hazard function of $t$. Show some work. 
        \item If $x$ is increased by one unit, the hazard function is multiplied by \underline{\hspace{10mm}}. This is called the \emph{hazard ratio}.
    \end{enumerate}

\item Let $\epsilon \sim \exp(1)$.  
    \begin{enumerate}
        \item Derive the density of $W = \epsilon^\sigma$, where $\sigma>0$.
        \item That density is Weibull. What are the parameters $\alpha$ and $\lambda$?
    \end{enumerate}

\item Let $W$ have a Weibull distribution with parameters $\alpha$ and $\lambda$, and let $T = c \, W$, where $c$ is a positive constant. Show that the distribution of $T$ is Weibull, and give the parameters.

\item \label{wmodel} Let the failure time 
$t_i = \exp\{\beta_0 + \beta_1x_{i,1} + \ldots + \beta_{p-1}x_{i,p-1} \} \cdot \epsilon_i^\sigma$, where again, $\epsilon \sim \exp(1)$.
    \begin{enumerate}
        \item Based on your answer to the preceding questions, what is the distribution of $t_i$? Just write down the answer.
        \item Show that if $x_{i,k}$ is increased by $c$ units, $t_i$ is multiplied by $e^{c\beta_k}$.
    \end{enumerate}

% \pagebreak

\item For the model of Question~\ref{wmodel}, give the following. Don't do more calculation than you have to.
    \begin{enumerate}
        \item Expected value of $t$.
        \item Median of $t$.
        \item Survival function of $t$.
        \item Hazard function $h(t)$.
        \item If $x_{i,k}$ is increased by one unit, the hazard function is multiplied by \underline{\hspace{10mm}}. 
    \end{enumerate}

% \pagebreak

\item Show that the Weibull model of Question~\ref{wmodel} has proportional hazards. That is, consider the hazard functions of two individuals with different $\mathbf{x}$ vectors. Show that the ratio of their hazard functions does not depend on $t$. This means that the two hazard functions are always in the same proportion at every point in time.

\item A sample of lung cancer patients are classified according to their type of cancer: squamous, small cell, adenocarcinoma, and large cell. We also have age and physician's rating of how far the disease has progressed on a scale from 1-10, which we will call ``severity." Small cell lung cancer is found exclusively in smokers, ex-smokers, and people who have worked in the asbestos industry.

    \begin{enumerate} 
        \item \label{wreg} Write a (multiplicative) Weibull regression equation, denoting the length of time between diagnosis and death (call it survival time) for patient $i$ by $t_i$. Denote age by $x_{i,1}$ and disease severity by $x_{i,2}$. There should be \emph{no interactions} in the model, in case you know what that is. You do not need to say how the dummy variables are defined. You will do that in the next part. Complete the equation below.
        
\vspace{3mm}

$t_i = $

\vspace{3mm}

        \item In the table below, make columns showing how your dummy variables are defined. Make small cell the reference category. In the last column, write the expected survival time, using the notation of the model of Question~\ref{wmodel}. If \emph{symbols} for your dummy variables appear in the last column, the answer is wrong. \vspace{4mm}

\hspace{3.6in} Expected Survival Time
\begin{center}
\renewcommand{\arraystretch}{2.5}
\begin{tabular}{|l|c|c|}  \hline
Squamous & \hspace{50mm} & \hspace{70mm} \\ \hline
Small Cell   &     &  \\ \hline
Adeno        &     &  \\ \hline
Large Cell   &     &  \\ \hline
\end{tabular}
\renewcommand{\arraystretch}{1.0}
\end{center} \vspace{10mm}

        \item In the notation of your model, what is the expected survival time for a 45-year-old patient with adenocarcinoma and a disease severity of 6? 

% \pagebreak %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

        \item You want to produce a large-sample confidence interval for expected survival time, for a 50-year-old patient with adenocarcinoma and a disease severity of 2. You need to use the delta method. 
                \begin{enumerate}
                    \item What is the parameter vector $\boldsymbol{\theta}$? Give a general answer for your model. 
                    \item What is \.{g}$(\boldsymbol{\theta}) $ for this particular example?
                \end{enumerate}

        \item For a patient with large cell lung cancer, expected survival time is \underline{\hspace{20mm}} times as great as the expected survival time for a patient with small-cell lung cancer. Answer in terms of the Greek letters from your model. Do age and disease severity affect the answer (in this model)?


        \item For a 47-year-old patient with squamous lung cancer and a disease severity of 3, the median survival time is \underline{\hspace{20mm}} times as great as the median survival time for a 47-year-old with adenocarcinoma and a disease severity of 3. Answer in terms of the Greek letters from your model. 


        \item You want to know whether, controlling for age and disease severity, type of lung cancer has any effect on average survival time. What is the null hypothesis? Answer in terms of the Greek letters from your model.

\pagebreak %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

        \item That last question could be answered with either a large-sample likelihood ratio test, or a Wald test.
                \begin{enumerate}
                    \item Suppose you decided on a likelihood ratio test. Write the multiplicative Weibull regression equation for the restricted model. \vspace{3mm}
                    
$t_i =$
 \vspace{3mm}
                    \item Suppose you decided on a Wald test. Write the null hypothesis $H_0: \mathbf{L}\boldsymbol{\theta} = \mathbf{0}$ in matrix form.
                \end{enumerate}

        \item You want to know whether, allowing for type of cancer and disease severity, the patient's age has any connection to life expectancy. What is the null hypothesis? Answer in terms of the Greek letters from your model. 
        
        \item You want to know whether, controlling for age and disease severity, median survival time is different for patients with large-cell or small-cell cancer.. What is the null hypothesis? Answer in terms of the Greek letters from your model. 

        \item  You want to know whether, controlling for age and disease severity, median survival time is different for patients with squamous lung cancer or adenocarcinoma. What is the null hypothesis? Answer in terms of the Greek letters from your model.
    \end{enumerate}

\item \label{computer} The \texttt{survival} package has a built-in data set on patients with advanced lung cancer. Type \texttt{help(cancer)} for details.

    \begin{enumerate}
    
        \item An important question is whether self-ratings by the patients are useful predictors of survival. Ignoring all other variables for the present, fit a model with just one explanatory variable: Karnofsky performance score as rated by patient. Give the value of the test statistic, and also the p-value. Do you reject the null hypothesis at the $\alpha=0.05$ significance level? State the conclusion in plain, non-statistical language.

        \item Now fit a Weibull regression model with all the available explanatory variables, excluding institution. Controlling for all other explanatory variables, is the Karnofsky performance score as rated by patient related to survival time? Give the value of the test statistic, and also the $p$-value. Do you reject the null hypothesis at the $\alpha=0.05$ significance level? State the conclusion in plain, non-statistical language.
        
        \item You will now observe something that has led to many incorrect conclusions based on likelihood ratio tests. Without dropping any variables at this time, test the diet and weight loss variables in a single test, controlling for all the other explanatory variables. Do it two ways, with a likelihood ratio test and a Wald test. Guided by the $\alpha=0.05$ significance level, what do you conclude in each case? Do the results agree? Which one is more compatible with the results of the $Z$-tests?
        
        \item First, note that the default behaviour of survreg is to omit cases with any missing values. Now look at the output of \texttt{summary(cancer)}. Do you see the missing values for \texttt{meal.cal} and \texttt{wt.loss}? Recalling that the likelihood ratio test statistic $G^2$ is the difference between two -2 log likelihoods and that the -2 log likelihood measures badness of model fit, explain how the -2 log likelihood is affected by missing values in the variables that are omitted from a restricted model.
        
        \item The solution is to base both full and restricted models upon a data set that has no missing values for the full model. Create a data frame with this property. Yes, specifically a data \emph{frame}. See \texttt{help(na.omit)}. Please note that you do not want to omit the one case that has institution missing, but no other missing data. My data frame has 228 rows and 9 columns.
        
        \item Based on this new data frame with no missing values, fit the full and restricted models, and test the difference between them with a likelihood ratio test. My test statistic value is $G^2 = 3.279398$. Are these results closer to the Wald test? 
        
        \item Now drop \texttt{age}, \texttt{pat.karno}, \texttt{meal.cal} and \texttt{wt.loss}, obtaining a smaller model that we hope is cleaner and better for prediction. Looking at the \texttt{summary} output for this model, one is forced to wonder whether it's necessary to make busy doctors fill out two questionnaires instead of just one, especially since the some of the questions are likely very similar. Based on this consideration, drop another variable. We now have a model with just two explanatory variables. Fit that model and look at summary.
                \begin{enumerate}
                    \item Controlling for physician's rating of how poorly the patient is doing, is median survival time different for males and female patients? Give the null hypothesis in symbols, the value of the test statistic and the $p$-value (numbers), and state the conclusion in plain, non-statistical language.  
                    \item Allowing for patient's gender, is physician's rating informative about survival time?  Give the null hypothesis in symbols, the value of the test statistic and the $p$-value (numbers), and state the conclusion in plain, non-statistical language.
                    \item The default output of summary includes a test that will tell you whether the estimated hazard function is increasing or decreasing, without actually plotting it. Do the necessary paper-and-pencil proof. Then, state the null hypothesis in symbols, give the value of the test statistic and the $p$-value (numbers from the printout), and state the conclusion in plain, non-statistical language.
                    \item Estimate the median survival time for female patients with an \texttt{ecog} rating of 1. Include a 95\% confidence interval.
                    \item This analysis coud continue. Look at \texttt{table(ph.ecog)}. What is the next thing you would do with the data?
                \end{enumerate}
        
    \end{enumerate} % End computer questions

\end{enumerate} % End of all the questions

    \noindent
    Please bring your printout to the quiz. \textbf{Your printout should show \emph{all} R input and output, 
    and \emph{only} R input and output}. Do not write anything on your printouts except your name and student 
    number. 
% \pagebreak %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%


\end{document}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%