\documentclass[10pt]{article} %\usepackage{amsbsy} % for \boldsymbol and \pmb \usepackage{graphicx} % To include pdf files! \usepackage{amsmath} \usepackage{amsbsy} \usepackage{amsfonts} \usepackage[colorlinks=true, pdfstartview=FitV, linkcolor=blue, citecolor=blue, urlcolor=blue]{hyperref} % For links \usepackage{fullpage} %\pagestyle{empty} % No page numbers \begin{document} %\enlargethispage*{1000 pt} \begin{center} {\Large \textbf{STA 2101/442 Assignment Nine}}\footnote{Copyright information is at the end of the last page.} \vspace{1 mm} \end{center} \noindent The non-computer questions are just practice for the quiz, and are not to be handed in. Use R for Questions~\ref{sales} and~\ref{product}, and bring your printout to the quiz. \textbf{Your printout should show \emph{all} R input and output, and \emph{only} R input and output}. Do not write anything on your printouts except your name and student number. \vspace{1mm} % More permutation testing? Maybe an ancova. Not pigs: That's the final % Bootstrap Quiz 5 has SAT. Try H0:beta1*beta2=0 \begin{enumerate} \item \label{sales} Telephone sales representatives use computer software to help them locate potential customers, answer questions, take credit card information and place orders. Twelve sales representatives were randomly assigned to each of three new software packages the company was thinking of purchasing. The data for each sales representative include the software package (1, 2 or 3), sales last quarter with the old software, and sales this quarter with one of the new software packages. Sales are in number of units sold. The data are in \href{http://www.utstat.toronto.edu/~brunner/data/legal/sales.data.txt} {\texttt{sales.data.txt}}. Get the data with {\small \begin{verbatim} sales = read.table("http://www.utstat.toronto.edu/~brunner/data/legal/sales.data.txt",header=T). \end{verbatim} } % End size The explanatory and response variables are what you would think. \begin{enumerate} \item Fit a full model in which the slopes and intercepts of the regression lines relating sales last quarter to sales this quarter might depend on the kind of software the sales representatives are using. \item Carry out an ordinary $F$-test to determine whether the effect of software type on sales depends on the representative's performance last quarter. Be able to state your conclusion in plain, non-statistical language. \item Estimate the slopes of the three regression lines. Make sure these numbers are on your printout. I don't see how you can do this without making a table. \item Carry out tests to answer these questions. If they are already on the output of \texttt{summary}, use that. \begin{enumerate} \item Are the slopes for Software 1 and 2 different? \item Are the slopes for Software 1 and 3 different? \item Are the slopes for Software 2 and 3 different? \end{enumerate} Protecting the three tests with a Bonferroni correction at the joint 0.05 significance level, what do you conclude? Plain language is not necessary, but you should say what happened. \item \label{diffatmean} The average (sample mean) performance last quarter was 76.56 (please use exactly this number). We are interested in whether the three software packages differ in their effectiveness for sales representatives with average performance last quarter. \begin{enumerate} \item Estimate expected performance this quarter for sales representatives with average performance last quarter. These three numbers should appear on your printout. \item State the null hypothesis in symbols. \item Carry out the $F$-test. % p = 0.5488 \item In plain language, what do you conclude? \end{enumerate} \item Now we will try a randomization test. Sales last quarter and sales this quarter are what they are, and furthermore the pairs stay together, to preserve the strong relationship between the covariate and response variable. We'll randomly shuffle the \emph{pairs} against the fixed software variable, and carry out a randomization test as in Question~\ref{diffatmean} --- that is, to find out whether the three software packages differ in their effectiveness for sales representatives with sales of 76.56 last quarter. Use the $F$-statistic as your test statistic. Your final answer is a randomization $p$-value. Mine was very close to what I got from the classical $F$-test in Question~\ref{diffatmean}. \end{enumerate} \pagebreak \item \label{product} As a student recently observed, we can easily test the null hypothesis that $\beta_1=0$ and $\beta_2=0$, but what about the null hypothesis that $\beta_1=0$ \emph{or} $\beta_2=0$?\footnote{Or both.} This is quite practical, because the alternative is that both parameters are non-zero. The trouble is that $H_0: \beta_1\beta_2=0$ is not a linear null hypothesis, and the general linear $F$ test only applies to collections of linear restrictions on the $\beta$ values. Why don't we bootstrap $T=\widehat{\beta}_1\widehat{\beta}_2$, and if the 95\% quantile confidence interval does not include zero, we'll reject $H_0: \beta_1\beta_2=0$ at the 0.05 level. Use the \href{http://www.utstat.toronto.edu/~brunner/data/legal/openSAT.data.txt}{SAT data} again. Get the data with {\small \begin{verbatim} sat = read.table("http://www.utstat.toronto.edu/~brunner/data/legal/openSAT.data.txt"). \end{verbatim} } % End size Your objective is to produce two numbers, the lower confidence limit and the upper confidence limit. Do you reject the null hypothesis? What do you conclude? \item Arsenic is a powerful poison, which is why it has been used on farms for many years to kill insects. Even in very small amounts, arsenic can cause cancer in humans, and recently it has been found that rice and foods made from rice tend to be very high in arsenic. Brown rice is worse, by the way. In a controlled experiment, pots of rice were prepared by either washing the rice first or not, and by cooking the rice in either a low, a medium or a high amount of water. The response variable is amount of arsenic in the cooked rice. \begin{enumerate} \item Use a regression model with \emph{cell means coding}. That's the model with no intercept, and one indicator dummy variable for each treatment combination. You don't have to say how the dummy variables are defined. That will become clear in the next part. Just give the regression equation. \item Write the expected amounts of arsenic in the table below, in terms of the $\beta$ parameters of your model. \begin{center} \begin{tabular}{|l|c|c|c|} \hline & \multicolumn{3}{|c|}{Amount of Water} \\ \hline & Low & Medium & High \\ \hline Washed & ~~~~~~~~~~ & ~~~~~~~~~~ & ~~~~~~~~~~ \\ \hline Unwashed & ~~~~~~~~~~ & ~~~~~~~~~~ & ~~~~~~~~~~ \\ \hline \end{tabular}\end{center} \item If you wanted to test whether the effect of washing the rice depended on how much water you cook it in, what is the null hypothesis? Give your answer in terms of the $\beta$ values in your model. \item If you wanted to test whether washing the rice before cooking has any effect if the rice is cooked in a lot of water, what is the null hypothesis? Give your answer in terms of $\beta$ values. \item Suppose you want to test whether the amount of water used to cook the rice makes any difference if the rice has been washed. What is the null hypothesis? Give your answer in terms of $\beta$ values. \item Averaging across different amounts of water used to cook the rice, does pre-washing affect the amount of arsenic in the rice. What null hypothesis would you test to answer this question? Give your answer in terms of $\beta$ values. \item If you wanted to test whether the effect of the amount of water used to cook the rice depends on whether you wash it first, what is the null hypothesis? Give your answer in terms of $\beta$ values. \end{enumerate} % Specifying that all the sample sizes are equal and asking for a non-centrality parameter makes a nice last part to this question, but it's too time-consuming for the final. \item Consider a two-factor analysis of variance in which each factor has two levels. Use this regression model for the problem: \begin{displaymath} Y_i = \beta_0 + \beta_1 d_{i,1} + \beta_2 d_{i,2} + \beta_3 d_{i,1}d_{i,2} + \epsilon_i, \end{displaymath} where $d_{i,1}$ and $d_{i,2}$ are dummy variables. %\pagebreak \begin{enumerate} \item Make a two-by-two table showing the four treatment means in terms of $\beta$ values. Use \emph{effect coding}. In terms of the $\beta$ values, state the null hypothesis you would use to test for \begin{enumerate} \item Main effect of the first factor \item Main effect of the second factor \item Interaction \end{enumerate} \item Make a two-by-two table showing the four treatment means in terms of $\beta$ values. Use \emph{indicator dummy variables} (zeros and ones). In terms of the $\beta$ values, state the null hypothesis you would use to test for \begin{enumerate} \item Main effect of the first factor \item Main effect of the second factor \item Interaction \end{enumerate} \item Which dummy variable scheme do you like more? \end{enumerate} \item In a study of math education in elementary school, equal numbers of boys and girls were randomly assigned to one of three training programmes designed to improve spatial reasoning. After five school days of training, the students were given a standardized test of spatial reasoning. Score on the spatial reasoning test is the response variable. You will define a regression model for this factorial analysis of variance. Don't write the model yet. \begin{enumerate} \item In the table below, show how your dummy variables are defined. \emph{Use effect coding.} That's the scheme with an intercept and minus ones. Write the name of each dummy variable at the head of its column. \begin{center} \begin{tabular}{|l|c|} \hline Girls, Programme 1 & ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ \\ \hline Girls, Programme 2 & \\ \hline Girls, Programme 3 & \\ \hline Boys, Programme 1 & \\ \hline Boys, Programme 2 & \\ \hline Boys, Programme 3 & \\ \hline \end{tabular} \end{center} \item Give $E[Y_i|\mathbf{X}_i=\mathbf{x}_i]$ for the full model. Include the interaction terms. Notice you are \emph{not} being asked to write expected values in the table. They are too messy. \item Suppose you want to test whether, averaging across training programmes, there is a difference between girls and boys in their average performance on the spatial reasoning test. State the null hypothesis in terms of the $\beta$ values in your model. \item Suppose you want to test whether, averaging across boys and girls, there is a difference between training programmes in average performance on the spatial reasoning test. State the null hypothesis in terms of the $\beta$ values in your model. \item Suppose you want to test whether the sex difference in average performance depends on which training programme the children are in. State the null hypothesis in terms of the $\beta$ values in your model. \end{enumerate} \end{enumerate} Please bring your printouts for Questions~\ref{sales} and~\ref{product} to the quiz. \textbf{Your printouts should show \emph{all} R input and output, and \emph{only} R input and output}. Do not write anything on your printouts except your name and student number. % \vspace{50mm} \noindent \begin{center}\begin{tabular}{l} \hspace{6in} \\ \hline \end{tabular}\end{center} This assignment was prepared by \href{http://www.utstat.toronto.edu/~brunner}{Jerry Brunner}, Department of Statistics, University of Toronto. It is licensed under a \href{http://creativecommons.org/licenses/by-sa/3.0/deed.en_US} {Creative Commons Attribution - ShareAlike 3.0 Unported License}. Use any part of it as you like and share the result freely. The \LaTeX~source code is available from the course website: \href{http://www.utstat.toronto.edu/~brunner/oldclass/appliedf16} {\texttt{http://www.utstat.toronto.edu/$^\sim$brunner/oldclass/appliedf16}} \end{document} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%