\documentclass[12pt]{article} %\usepackage{amsbsy} % for \boldsymbol and \pmb \usepackage{graphicx} % To include pdf files! \usepackage{amsmath} \usepackage{amsbsy} \usepackage{amsfonts} % for \mathbb{R} The set of reals \usepackage[colorlinks=true, pdfstartview=FitV, linkcolor=blue, citecolor=blue, urlcolor=blue]{hyperref} % For links \usepackage{fullpage} %\pagestyle{empty} % No page numbers \begin{document} %\enlargethispage*{1000 pt} \begin{center} {\Large \textbf{STA 302f16 Assignment Nine}}\footnote{Copyright information is at the end of the last page.} \vspace{1 mm} \end{center} \noindent Except for Problem~\ref{computer}, these problems are preparation for the quiz in tutorial on Thursday November 17th, and are not to be handed in. As usual, sometimes you may be asked to prove things that are false. Please bring your printout for Problem~\ref{computer} to the quiz. Do not write anything on the printout in advance of the quiz, except possibly your name and student number. \begin{enumerate} \item Consider a linear regression model with $n>p$, which is always the case in practice. Since the vector of residuals $\mathbf{e} \sim N_n\left(\mathbf{0},\sigma^2(I-H)\right)$, it is tempting to write % \break $\frac{1}{\sigma^2}\mathbf{e}^\prime (I-H)^{-1} \mathbf{e} \sim \chi^2(n)$. Please locate support for this idea on the formula sheet. But it only works if the $n \times n$ matrix $I-H$ has an inverse. Calculate $(I-H) \, X$, and use this to show that if $(I-H)^{-1}$ exists, the columns of $X$ cannot be linearly independent. \item This question will be a lot easier if you remember that if $X \sim \chi^2(\nu)$, then $E(X)=\nu$ and $Var(X)=2\nu$. You don't have to prove these facts; just use them. For the usual linear regression model with normal errors, $\sigma^2$ is usually estimated with $s^2 = \mathbf{e}^\prime\mathbf{e}/(n-k-1)$. \begin{enumerate} \item Show that $s^2$ is an unbiased estimator of $\sigma^2$. You did this the hard way in an earlier assignment. It's much easier when the errors are normal. \item What is the distribution of $\sum_{i=1}^n \left( \frac{\epsilon_i-0}{\sigma} \right)^2$? \item Here is another estimate of $\sigma^2$. Define $v = \frac{1}{n} \sum_{i=1}^n \epsilon_i^2$. What is $E(v)$? \item Show that $Var(v) < Var(s^2)$. \item So it would appear that $v$ is a better estimator of $\sigma^2$ than $s^2$ is, since they are both unbiased and the variance of $v$ is lower. So why do you think $s^2$ is used in regression analysis instead of $v$? \end{enumerate} \item \label{workout} In a study comparing the effectiveness of different exercise programmes, volunteers were randomly assigned to one of three exercise programmes ($A$, $B$, $C$) or put on a waiting list and told to work out on their own. Aerobic capacity is the body's ability to process oxygen. Aerobic capacity was measured before and after 6 months of participation in the program (or 6 months of being on the waiting list). The response variable was improvement in aerobic capacity. The explanatory variables were age (a covariate) and treatment group. \begin{enumerate} \item First consider a regression model with an intercept, and no interaction between age and treatment group. \begin{enumerate} \item Make a table showing how you would set up indicator dummy variables for treatment group. Make Waiting List the reference category \item Write the regression model. Please use $x$ for age, and make its regression coefficient $\beta_1$. \item In terms of $\beta$ values, what null hypothesis would you test to find out whether, allowing for age, the three exercise programmes differ in their effectiveness? That's the \emph{three} programs, not including the wait list control. \item Write the null hypothesis for the preceding question as $H_0: C\boldsymbol{\beta}=\mathbf{0}$. Just give the $C$ matrix. \item In terms of $\beta$ values, what null hypothesis would you test to find out whether Programme $B$ was better than the waiting list? \item In terms of $\beta$ values, what null hypothesis would you test to find out whether Programmes $A$ and $B$ differ in their effectiveness? \item Suppose you wanted to estimate the difference in average benefit between programmes $A$ and $C$ for a 27 year old participant. Give your answer in terms of $b_j$ values. \item Is it safe to assume that age is independent of the other explanatory variables? Answer Yes or No and briefly explain. \end{enumerate} \end{enumerate} % \newpage \item \label{interac} Now consider a regression model with an intercept and the interaction (actually a set of interactions) between age and treatment. \begin{enumerate} \item Write the regression model. Make it an extension of your earlier model. \item Suppose you wanted to know whether the slopes of the 4 regression lines were equal. In terms of $\beta$ values, what null hypothesis would you test? \item Suppose you wanted to know whether any differences among mean improvement in the four treatment conditions depends on the participant's age. In terms of $\beta$ values, what null hypothesis would you test? \item Write the null hypothesis for the preceding question as $H_0: C\boldsymbol{\beta}=\mathbf{0}$. Just give the $C$ matrix. It is $r \times p$. What is $r$? What is $p$? \item Suppose you wanted to know whether the difference in effectiveness between Programme $A$ and the Waiting List depends on the participant's age. In terms of $\beta$ values, what null hypothesis would you test? \item Suppose you wanted to \emph{estimate} the difference in average benefit between programmes $A$ and $C$ for a 27 year old participant. Give your answer in terms of $\widehat{\beta}$ values. \end{enumerate} \newpage \item \label{transformX} A general principle is that all valid dummy variable coding schemes are equivalent. This is because they are one-to-one linear transformations of one another. Let $A$ be a $(k+1) \times (k+1)$ nonsingular matrix. Note that $X^*=XA$ is a one-to-one linear transformation of the explanatory variables, and \begin{displaymath} \mathbf{y} = X \boldsymbol{\beta} + \boldsymbol{\epsilon} ~ \Leftrightarrow ~ \mathbf{y} = XA \, A^{-1}\boldsymbol{\beta} + \boldsymbol{\epsilon} = X^* \boldsymbol{\beta}^* + \boldsymbol{\epsilon}. \end{displaymath} This is already interesting, because it shows how transforming the explanatory variables changes the meaning of the regression coefficients. Refer to $ \mathbf{y} = X \boldsymbol{\beta} + \boldsymbol{\epsilon}$ as the ``original" model, and $ \mathbf{y} = X^* \boldsymbol{\beta}^* + \boldsymbol{\epsilon}$ as the ``transformed" model. \begin{enumerate} \item Just to make this more concrete, suppose you have a 3-category explanatory variable and a quantiative covariate. $Y_i = \beta_0 + \beta_1d_{i,1} + \beta_2d_{i,2} + \beta_3x_{i} + \epsilon_i$, where $d_{i,1}$ and $d_{i,2}$ are indicator dummy variables for the first two groups. You want to switch to cell means coding, so that $Y_i = \beta^*_1g_{i,1} + \beta^*_2g_{i,2} + \beta^*_3g_{i,3} + \beta^*_4x_{i} + \epsilon_i$. Note that $\beta^*_4 = \beta_3$. Give the matrix $A$; you can make tables if that helps. \item Write down the least squares estimate $\mathbf{b}^*$ for the transformed model, and simplify. How is $\mathbf{b}^*$ related to $\mathbf{b}$? Give a formula. \item Compare the vector of predicted values from the two models. \item Compare the vector of residuals from the two models. \item Which is greater, $SSE$ or $SSE^*$? \item Suppose you want to test $H_0: C\boldsymbol{\beta} = \boldsymbol{\gamma}$. Give the equivalent null hypothesis for the transformed model. That is, what are matrices $C^*$, $\boldsymbol{\beta}^*$ and $\boldsymbol{\gamma}^*$ in $H_0: C^*\boldsymbol{\beta}^* = \boldsymbol{\gamma}^*$? \item Compare the $F$ statistic for $H_0: C^*\boldsymbol{\beta}^* = \boldsymbol{\gamma}^*$ to the $F$ statistic for $H_0: C\boldsymbol{\beta} = \boldsymbol{\gamma}$. \end{enumerate} \item \label{onecol} Question \ref{transformX} suggests that if a regression model with no intercept is equivalent to one with an intercept, then the residuals will add to zero. This is good to know, because it means $SST=SSR+SSE$, and $R^2$ is meaningful; so is $a$, the proportion of remaining variation. Here is an easy condition to check. Let $\mathbf{1}$ denote an $n \times 1$ column of ones. Show that if there is a $(k+1) \times 1$ vector of constants $\mathbf{v}$ with $X\mathbf{v}= \mathbf{1}$, then $\sum_{i=1}^ne_i=0$. (Another way to state this is that if there is a linear combination of the columns of $X$ that equals a column of ones, then the sum of residuals equals zero. Clearly this applies to a model with cell means coding.) \item Based on the general linear model with normal error terms, \begin{enumerate} \item Prove the $t$ distribution given on the formula sheet for a new observation $y_0$. Use earlier material on the formula sheet. For example, how do you know numerator and denominator are independent? \item Derive the $(1-\alpha)\times 100\%$ prediction interval for a new observation from this population, in which the independent variable values are given in $\mathbf{x}_0$. ``Derive" means show the High School algebra. \end{enumerate} \item Suppose you have a random sample from a normal distribution, say $y_1, \ldots, y_n \stackrel{i.i.d.}{\sim} N(\mu,\sigma^2)$. If someone randomly sampled another observation from this population and asked you to guess what it was, there is no doubt you would say $\overline{y}$, and a confidence interval for $\mu$ is routine. But what if you were asked for a \emph{prediction} interval for a \emph{new} observation? Accordingly, suppose the normal model is reasonable and you observe a sample mean of $\overline{y} = 7.5$ and a sample variance (with $n-1$ in the denominator) of $s^2=3.82$. The sample size is $n=14$. Give a $95\%$ prediction interval for the next observation. The answer is a pair of numbers. Be able to show your work. You can get the distribution result you need from the formula sheet, or you can re-derive it for this special case. Be able to do it both ways. You should use R to get the critical value, but don't bother to bring your R printout for this question. % 305f15 Final exam has a nice version of this question with proper spacing for the answers. \item \label{computer} Pigs are routinely given large doses of antibiotics even when they show no signs of illness, to protect their health under unsanitary conditions. Pigs were randomly assigned to one of three antibiotic drugs. Dressed weight (weight of the pig after slaughter and removal of head, intestines and skin) was the dependent variable. Independent variables are Drug type, Mother's live adult weight and Father's live adult weight. Data are in the file \href{http://www.utstat.toronto.edu/~brunner/data/legal/pigweight.data.txt} {\texttt{pigweight.data.txt}}. You can get a copy with {\footnotesize \begin{verbatim} oink = read.table("http://www.utstat.toronto.edu/~brunner/data/legal/pigweight.data.txt"). \end{verbatim} } % End size \begin{enumerate} \item Write the regression equation for the full model, including $\epsilon_i$. \item Make a table with one row for every drug, with columns showing how the dummy variables were defined. Make another column giving $E(y|\mathbf{x})$ for each drug. \item Predict the dressed weight of a pig getting Drug 2, whose mother weighed 140 pounds, and whose father weighed 185 pounds. Your answer is a single number. \item This parallel planes regression model specifies that the differences in expected weight for the different drug treatments are the same for every possible combination of mother's weight and father's weight. Give a 95\% confidence interval for the difference in expected weight between drug treatments 2 and 3. The final answer is a pair of numbers, a lower confidence limit and an upper confidence limit. There is an easy way and a less easy way. \item In symbols, give the null hypotheses you would test to answer the following questions. Your answers are statements involving the $\beta$ values from your regression equation. \begin{enumerate} % There were more questions in an earlier draft. \item Controlling for mother's weight and father's weight, does type of drug have an effect on the expected weight of a pig? \item Controlling for mother's weight and father's weight, which drug helps the average pig gain more weight, Drug 1 or Drug 2? \item Controlling for mother's weight and father's weight, which drug helps the average pig gain more weight, Drug 1 or Drug 3? \item Controlling for mother's weight and father's weight, which drug helps the average pig gain more weight, Drug 2 or Drug 3? \end{enumerate} \item For each of the questions below, give the value of the $t$ or $F$ statistic (a number from your printout), and indicate whether or not you reject the null hypothesis. The numbers may or may not be part of the default output from \texttt{summary}. \begin{enumerate} \item Controlling for mother's weight and father's weight, does type of drug have an effect on the expected weight of a pig? \item Controlling for mother's weight and father's weight, which drug helps the average pig gain more weight, Drug 1 or Drug 2? \item Controlling for mother's weight and father's weight, which drug helps the average pig gain more weight, Drug 1 or Drug 3? \item Controlling for mother's weight and father's weight, which drug helps the average pig gain more weight, Drug 2 or Drug 3? \item Allowing for which drug they were given, does expected weight of a pig increase faster as a function of the mother's weight, or does it increase faster as a function of the father's weight? \end{enumerate} \item We can assume that farmers want their pigs to weigh a lot. In plain, non-statistical language, can you offer some advice to a farmer based on these data? Remember, the farmer must be able to understand your answer or it is worthless. \end{enumerate} % End computer question \noindent Please bring your printout to the quiz. \textbf{Your printout should show \emph{all} R input and output, and \emph{only} R input and output}. Do not write anything on your printouts except your name and student number. \end{enumerate} % End of assignment \vspace{50mm} \noindent \begin{center}\begin{tabular}{l} \hspace{6in} \\ \hline \end{tabular}\end{center} This assignment was prepared by \href{http://www.utstat.toronto.edu/~brunner}{Jerry Brunner}, Department of Statistical Sciences, University of Toronto. It is licensed under a \href{http://creativecommons.org/licenses/by-sa/3.0/deed.en_US} {Creative Commons Attribution - ShareAlike 3.0 Unported License}. Use any part of it as you like and share the result freely. The \LaTeX~source code is available from the course website: \href{http://www.utstat.toronto.edu/~brunner/oldclass/302f16} {\small\texttt{http://www.utstat.toronto.edu/$^\sim$brunner/oldclass/302f16}} \end{document} \item Now consider a regression model \emph{without} an intercept, but \emph{with} possibly unequal slopes. Make a table to show how the dummy variables could be set up, and write the regression model. Again, please use $x$ for age and make its regression coefficient $\beta_1$. This model needs to have the \emph{same number of regression coefficients as the model of Question~\ref{interac}}, so you have to think about this a little. For each treatment condition, what is the conditional expected value of $Y$? The answer is in terms of $x$ and the $\beta$ values. Please put these values as the last column of your table.