\documentclass[11pt]{article} 
%\usepackage{amsbsy} % for \boldsymbol and \pmb 
\usepackage{graphicx} % To include pdf files!
\usepackage{amsmath}
\usepackage{amsbsy}
\usepackage{amsfonts}
\usepackage{alltt} % For colours within a verbatim-like environment
\usepackage[colorlinks=true, pdfstartview=FitV, linkcolor=blue, citecolor=blue, urlcolor=blue]{hyperref} % For links
\usepackage{fullpage}
%\pagestyle{empty} % No page numbers


\begin{document}
%\enlargethispage*{1000 pt} 

\begin{center}   
{\Large \textbf{STA 312f22 Assignment Nine}}\footnote{Copyright information is at the end of the last page.}
\vspace{1 mm}
\end{center}

\noindent
Please bring your R printout for Question \ref{titanic} to the quiz. The non-computer questions are practice for the quiz on Friday Nov. 25th, and are not to be handed in.  


\begin{enumerate}

\item Arsenic is a powerful poison, which is why it has been used on farms for many years to kill insects. Even in very small amounts, arsenic can cause cancer in humans, and recently it has been found that rice and foods made from rice (especially rice grown in the United States) tend to be very high in arsenic. Brown rice is worse, by the way.

In a controlled experiment, pots of rice were prepared by either washing the rice first or not, and by cooking the rice in either a low, a medium or a high amount of water. The response variable is amount of arsenic in the cooked rice. It's a continuous variable, and this is normal theory regression.

        \begin{enumerate}
            \item Use a regression model with \emph{cell means coding}. That's the model with no intercept, and one indicator dummy variable for each treatment combination. You don't have to say how the dummy variables are defined. That will become clear in the next part. Just give the regression equation. 
            \item  Write the expected amounts of arsenic in the table below, in terms of the $\beta_j$ parameters of your model.
\begin{center}
\begin{tabular}{|l|c|c|c|}  \hline
 & \multicolumn{3}{|c|}{Amount of Water} \\ \hline
          & Low  & Medium  &  High \\ \hline
Washed    & ~~~~~~~~~~  &  ~~~~~~~~~~   &  ~~~~~~~~~~  \\ \hline
Unwashed  & ~~~~~~~~~~  &  ~~~~~~~~~~   &  ~~~~~~~~~~  \\ \hline
\end{tabular}\end{center}

            \item  If you wanted to test whether the effect of washing the rice depended on how much water you cook it in, what is the null hypothesis? Give your answer in terms of the $\beta_j$ values in your model. 

            \item  If you wanted to test whether washing the rice before cooking has any effect if the rice is cooked in a lot of water, what is the null hypothesis? Give your answer in terms of $\beta_j$ values. 

            \item Suppose you want to test whether the amount of water used to cook the rice makes any difference if the rice has been washed. What is the null hypothesis? Give your answer in terms of $\beta_j$ values. 

            \item Averaging across different amounts of water used to cook the rice, does pre-washing affect the amount of arsenic in the rice. What null hypothesis would you test to answer this question? Give your answer in terms of $\beta_j$ values. 

            \item If you wanted to test whether the effect of the amount of water used to cook the rice depends on whether you wash it first, what is the null hypothesis? Give your answer in terms of $\beta_j$ values. 
        \end{enumerate}
% Specifying that all the sample sizes are equal and asking for a non-centrality parameter makes a nice last part to this question, but it's too time-consuming for the final and anyway I didn't do power in 2017.

\pagebreak 

\item Consider a two-factor analysis of variance in which each factor has two levels. Use this regression model for the problem: 
\begin{displaymath}
    Y_i = \beta_0 + \beta_1 d_{i,1} + \beta_2 d_{i,2} + \beta_3 d_{i,1}d_{i,2} + \epsilon_i,
\end{displaymath}
where $d_{i,1}$ and $d_{i,2}$ are dummy variables. 

%\pagebreak

    \begin{enumerate}
        \item Make a two-by-two table showing the four treatment means in terms of $\beta$ values. Use \emph{effect coding}. In terms of the $\beta$ values, state the null hypothesis you would use to test for 
            \begin{enumerate}
                \item Main effect of the first factor
                \item Main effect of the second factor
                \item Interaction
             \end{enumerate}
        \item Make a two-by-two table showing the four treatment means in terms of $\beta$ values. Use \emph{indicator dummy variables} (zeros and ones). In terms of the $\beta$ values, state the null hypothesis you would use to test for 
            \begin{enumerate}
                \item Main effect of the first factor
                \item Main effect of the second factor
                \item Interaction
             \end{enumerate}
        \item Which dummy variable scheme do you like more?
    \end{enumerate}


\item \label{titanic} This question uses the built-in R table \texttt{Titanic}. You may want to get the data in shape for logistic regression using the brutal approach illustrated in lecture (Logistic regression on the Berkeley data). There are other ways, but at least this one works and you have an example. Start by constructing a data frame with only the adults. My data frame has 2,092 rows.
    \begin{enumerate}
        \item The first analysis uses just adult males, because the main goal of this analysis is to compare the survival of passengers to crew, and most of the crew were men. There's a variable Class, which is 1st, 2nd, 3d and Crew. You'll test the relation of this variable to survival, and then compare the survival of Crew to the individual passenger classes.
            \begin{enumerate}
                \item First just as a check, test the relationship of Class to survival with a contingency table. Calculate the likelihood ratio test statistic and the $p$-value. I used the \texttt{table} and \texttt{chisq.test} functions. Also, please use \texttt{prop.table} on your contingency table, so you can see what actually happened!
                \item Now do the likelihood ratio test with logistic regression and dummy variables. \emph{Make crew the reference category}.
                \item What do you conclude from the tests for $\beta_1$, $\beta_2$, and $\beta_3$? Use plain, non-statistical language.
                \item The summary output also includes a test of $H_0: \beta_0=0$. What does this null hypothesis mean in terms of survival? In plain, non-statistical language, what do you conclude from the test?
             \end{enumerate} % End analysis of adult males.

\newpage %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

        \item Now we are going to look for a possible Sex by Class interaction, just for the adult passengers. 
            \begin{enumerate}
                \item \label{look} The first job is to take a look at the data and see what happened. This should be the first step in any data analysis. Making use of the fact that sample proportions are just sample means computed on the 0-1 outcome $y$, use \texttt{tapply} the way I did on the rotten potato data, and get a Sex by Class table of sample proportions. Each number in the table is the proportion of passengers who survived.
                \item Using effect coding (that's the dummy variable setup with 0, 1 and -1), test for the Sex by Class interaction with a likelihood ratio test and a Wald test. What is the critical value? (Check the formula sheet.) I get $G^2 = 64.074$.
                \item The test for interaction definitely indicates that the odds ratios are unequal for the three passenger classes. So, it's important to look at the estimated odds ratios. I think the best way to express them is odds of survival for women divided by odds of survival for men. There are several good ways to do this. It does not matter how you get the job done, but those three numbers should appear on your printout. My answer for first class is 72.45614. 
                \item Now use Wald tests to carry out all three pairwise comparisons of the odds ratios. The easiest way is to use cell means coding. That's the model with no intercept and an indicator dummy variable for each treatment combination. I didn't actually make the dummy variables myself. I constructed a combination variable using \texttt{paste}, like the variable \texttt{TB} in the rotten potato lecture. In plain, non-statistical language, what do you conclude from the pairwise comparisons? As usual, be guided by the 0.05 significance level.
                \item When you fit a model with just your combination variable, you get some $z$-tests. They are meaningful. Be able to say what each one means, in plain, non-statistical language.
                \item Just to verify that you know what's going on, transform the $\widehat{\beta}_j$ values into estimated survival probabilities with one line of code. Compare your answer to Question~\ref{look}.
             \end{enumerate} % End sex by class interaction sub-question.


    \end{enumerate} % End of the Titanic question

\end{enumerate} % End of all the questions

\vspace{30mm}

\begin{center}
\textbf{Please bring hard copy of your full R input and output to the quiz. Some of it may be handed in.}
\end{center}

%\newpage
\noindent
\begin{center}\begin{tabular}{l}
\hspace{6in} \\ \hline
\end{tabular}\end{center}
This assignment was prepared by  \href{http://www.utstat.toronto.edu/~brunner}{Jerry Brunner},
Department of Statistics, University of Toronto. It is licensed under a 
\href{http://creativecommons.org/licenses/by-sa/3.0/deed.en_US}
     {Creative Commons Attribution - ShareAlike 3.0 Unported License}. Use any part of it as you like and share the result freely. The \LaTeX~source code is available from the course website:
\href{http://www.utstat.toronto.edu/~brunner/oldclass/312f22} {\texttt{http://www.utstat.toronto.edu/brunner/oldclass/312f22}}

\end{document}