\documentclass[12pt]{article} %\usepackage{amsbsy} % for \boldsymbol and \pmb \usepackage{graphicx} % To include pdf files! \usepackage{amsmath} \usepackage{amsbsy} \usepackage{amsfonts} % for \mathbb{R} The set of reals \usepackage[colorlinks=true, pdfstartview=FitV, linkcolor=blue, citecolor=blue, urlcolor=blue]{hyperref} % For links \usepackage{fullpage} %\pagestyle{empty} % No page numbers \begin{document} %\enlargethispage*{1000 pt} \begin{center} {\Large \textbf{STA 302f17 Assignment Nine}}\footnote{Copyright information is at the end of the last page.} \vspace{1 mm} \end{center} \noindent Except for Problem~\ref{pigweight}, these problems are preparation for the quiz in tutorial on Thursday November 23d, and are not to be handed in. Please bring your printouts for Problem~\ref{pigweight} to the quiz. Do not write anything on the printouts in advance of the quiz, except possibly your name and student number. \begin{enumerate} \item Suppose you fit (estimate the parameters of) a regression model, obtaining $\mathbf{b}$, $\widehat{\mathbf{y}}$ and $\mathbf{e}$. Call this Model One. \begin{enumerate} \item In attempt to squeeze some more information out of the data, you fit a second regression model, using $\mathbf{e}$ from Model One as the dependent variable, and exactly the same $X$ matrix as Model One. Call this Model Two. \begin{enumerate} \item What is $\mathbf{b}$ for Model Two? Show your work and simplify. \item What is $\widehat{\mathbf{y}}$ for Model Two? Show your work and simplify. \item What is $\mathbf{e}$ for Model Two? Show your work and simplify. \item What is $s^2$ for Model Two? \end{enumerate} \item Now you fit a \emph{third} regression model, this time using $\widehat{\mathbf{y}}$ from Model One as the dependent variable, and again, exactly the same $X$ matrix as Model One. Call this Model Three. \begin{enumerate} \item What is $\mathbf{b}$ for Model Three? Show your work and simplify. \item What is $\widehat{\mathbf{y}}$ for Model Three? Show your work and simplify. \item What is $\mathbf{e}$ for Model Three? Show your work and simplify. \item What is $s^2$ for Model Three? \end{enumerate} \end{enumerate} \item Data for a regression study are collected at two different locations; $n_1$ observations are collected at location one, and $n_2$ observations are collected at location two. The same independent variables are used at each location. We need to know whether the error variance $\sigma^2$ is the same at the two locations, possibly because we are concerned about data quality. Recall the definition of the $F$ distribution. If $W_1 \sim \chi^2(\nu_1)$ and $W_2 \sim \chi^2(\nu_2)$ are independent, then $F = \frac{W_1/\nu_1}{W_2/\nu_2} \sim F(\nu_1,\nu_2)$. Suggest a statistic for testing $H_0: \sigma^2_1=\sigma^2_2$. Using facts from the formula sheet, show it has an $F$ distribution when $H_0$ is true. Don't forget to state the degrees of freedom. Assume that data coming from the two locations are independent. \pagebreak \item % This problem would be even better with the spectral decomposition of X'X-inverse. Say y-hat unaffected. Assume the usual linear model with normal errors and the columns of $X$ linearly independent; see the formula sheet. We know that a one-to-one linear transformation of the independent variables affects the interpretation of the $\beta_j$ parameters, but otherwise it has no effect. Suppose that a model is to be used for prediction only, so that interpretation of the regression coefficients is not an issue. Here is a transformation that has interesting effects; it is also convenient for some purposes. Since $X^\prime X$ is symmetric, we have the spectral decomposition $X^\prime X = CDC^\prime$, where $D$ is a diagonal matrix of eigenvalues (call them $\lambda_0, \lambda_1, \ldots, \lambda_k$), and the columns of $C$ are the corresponding eigenvectors. Suppose we transform $X$ by $X^*=XC$. This also transforms $\boldsymbol{\beta}$, and the corresponding estimated $\boldsymbol{\beta}^*$ is denoted by $\mathbf{b}^*$. \begin{enumerate} \item Could any of the eigenvalues be negative or zero? Answer Yes or No and briefly explain. This might require some review. \item Give a formula for $\mathbf{b}^*$. Simplify. \item What is the distribution of $\mathbf{b}^*$? Simplify. \item What is $Var(b^*_j)$? \item Are the $b^*_j$ random variables independent? Answer Yes or No. Why? \item What is the variance of the linear combination $\ell_0 b^*_0 + \ell_1 b^*_1 + \cdots + \ell_k b^*_k$? \end{enumerate} \item This question will be a lot easier if you remember that if $X \sim \chi^2(\nu)$, then $E(X)=\nu$ and $Var(X)=2\nu$. You don't have to prove these facts; just use them. For the usual linear regression model with normal errors, $\sigma^2$ is usually estimated with $s^2 = \mathbf{e}^\prime\mathbf{e}/(n-k-1)$. \begin{enumerate} \item Show that $s^2$ is an unbiased estimator of $\sigma^2$. You did this the hard way in an earlier assignment. It's much easier when the errors are normal. \item What is the distribution of $\sum_{i=1}^n \left( \frac{\epsilon_i-0}{\sigma} \right)^2$? \item Here is another estimate of $\sigma^2$. Define $v = \frac{1}{n} \sum_{i=1}^n \epsilon_i^2$. What is $E(v)$? \item Show that $Var(v) < Var(s^2)$. \item So it would appear that $v$ is a better estimator of $\sigma^2$ than $s^2$ is, since they are both unbiased and the variance of $v$ is lower. So why do you think $s^2$ is used in regression analysis instead of $v$? \end{enumerate} \pagebreak \item Regression diagnostics are mostly based on the residuals. This question compares the error terms $\epsilon_i$ to the residuals $e_i$. Answer True or False to each statement. \begin{enumerate} \item $E(\epsilon_i) = 0$ \item $E(e_i) = 0$ \item $Var(\epsilon_i) = 0$ \item $Var(e_i) = 0$ \item $\epsilon_i$ has a normal distribution. \item $e_i$ has a normal distribution. \item $\epsilon_1, \ldots, \epsilon_n$ are independent. \item $e_1, \ldots, e_n$ are independent. \end{enumerate} \item One of these statements is true, and the others are false. Pick one, and show it is true with a quick calculation. Start with something from the formula sheet. \begin{itemize} \item $\widehat{\mathbf{y}} = X \mathbf{b} + \mathbf{e}$ \item $\mathbf{y} = X \mathbf{b} + \mathbf{e}$ \item $\widehat{\mathbf{y}} = X \boldsymbol{\beta} + \mathbf{e}$ \end{itemize} As the saying goes, ``Data equals fit plus residual." \item The \emph{deleted residual} is $e_{(i)} = y_i - \mathbf{x}^\prime_i \mathbf{b}_{(i)}$, where $\mathbf{b}_{(i)}$ is defined as usual, but based on the $n-1$ observations with observation $i$ deleted. \begin{enumerate} \item Guided by an expression on the formula sheet, write the formula for the Studentized deleted residual $e_i^*$. You don't have to prove anything. You will need the symbols $X_{(i)}$ and $s^2_{(i)}$, which are defined in the natural way. \item If the model is correct, what is the distribution of the Studentized deleted residual? Make sure you have the degrees of freedom right. \item Why are numerator and denominator independent? \end{enumerate} \item You know that for the general linear regression model, $\widehat{\mathbf{y}}$ and $\mathbf{e}$ are independent, meaning that they are \emph{always} independent, for every regression model with normal error terms. \begin{enumerate} \item Are $\mathbf{y}$ and $\widehat{\mathbf{y}}$ independent? Answer Yes or No and prove your answer. % No, Cov = sigma^2 H neq 0. Plus sample r^2 = R^2 \item Are $\mathbf{y}$ and $\mathbf{e}$ independent? Answer Yes or No and prove your answer. % No, Cov = sigma^2 (I-H) neq 0. \end{enumerate} \item For the general linear regression model, calculate $X^\prime \, \mathbf{e}$ one more time. This will help with the next question. \item \label{gls} In the regression model $\mathbf{y} = X \boldsymbol{\beta} + \boldsymbol{\epsilon}$, let $\boldsymbol{\epsilon} \sim N_n(\mathbf{0},\sigma^2\Omega)$, with $\Omega$ a \emph{known} symmetric positive definite matrix. \begin{enumerate} \item Is $\mathbf{b}$ still an unbiased estimator of $\boldsymbol{\beta}$ for this problem? \item What is $cov(\mathbf{b})$ for this problem? \item Multiply $\mathbf{y} = X \boldsymbol{\beta} + \boldsymbol{\epsilon}$ on the left by $\Omega^{-1/2}$, obtaining $\mathbf{y}^* = X^* \boldsymbol{\beta} + \boldsymbol{\epsilon}^*$. What is the distribution of $\boldsymbol{\epsilon}^*$? \item Substituting $X^*$ and $\mathbf{y}^*$ into the formula for $\mathbf{b}$, obtain the generalized least squares estimate $\mathbf{b}_{gls} = (X^\prime\Omega^{-1}X)^{-1} X^\prime \Omega^{-1} \mathbf{y}$ on page 133 of the textbook. If you look in the textbook, I think you will appreciate the notation we are using. \item What is the distribution of $\mathbf{b}_{gls}$? Show your work. All you have to do is calculate the expected value and covariance matrix, but \emph{why}? What specific fact on the formula sheet are you using? \item Just realize that there's more. You could obtain formulas for the starred versions of $H$, $\widehat{\mathbf{y}}$, $\mathbf{e}$, and $F$ statistic for the general linear test. % Also, incorrect omega is at least unbiased. \end{enumerate} \item For a very simple aggregated data set, our data are a collection of sample means $\overline{y}_1, \ldots, \overline{y}_n$ based on $n$ independent random samples from a common population. Data values in the \emph{unaggregated} data set come from a distribution with common mean $\mu$ and common variance $\sigma^2$. Sample mean $i$ is based on $m_i$ observations, so that (approximately by the Central Limit Theorem), $\overline{y}_i \sim N(\mu,\frac{\sigma^2}{m_i})$. \begin{enumerate} \item One could estimate $\mu$ with the arithmetic mean of the sample means. Is this estimator unbiased? What is its variance? \item Start with the regression-like equation $\overline{y}_i = \mu + \epsilon_i$, where $\epsilon_i \sim N(\mu,\frac{\sigma^2}{m_i})$. Multiply both sides by $\sqrt{m_i}$, obtaining a starred version of the regression equation. What is $Var(e^*_i)$? \item Give the generalized (weighted) least squares estimate of $\mu$. Call it $\widehat{\mu}_{gls}$. \item If you had access to the unaggregated data (that is, all the $y_{ij}$ values), how would you estimate $\mu$? Yes, that's you personally. What is the connection of your statistic to $\widehat{\mu}_{gls}$? \end{enumerate} \pagebreak % 305f15 Final exam has a nice version of this question with proper spacing for the answers. \item \label{pigweight} Pigs are routinely given large doses of antibiotics even when they show no signs of illness, to protect their health under unsanitary conditions. Pigs were randomly assigned to one of three antibiotic drugs. Dressed weight (weight of the pig after slaughter and removal of head, intestines and skin) was the dependent variable. Independent variables are Drug type, Mother's live adult weight and Father's live adult weight. Data are in the file \href{http://www.utstat.toronto.edu/~brunner/data/legal/pigweight.data.txt} {\texttt{pigweight.data.txt}}. You can get a copy with {\footnotesize \begin{verbatim} oink = read.table("http://www.utstat.toronto.edu/~brunner/data/legal/pigweight.data.txt"). \end{verbatim} } % End size \begin{enumerate} \item Write the regression equation for the full model, including $\epsilon_i$. \item Make a table with one row for every drug, with columns showing how R would define the dummy variables by default. Make another column giving $E(y|\mathbf{x})$ for each drug. \item Fit the model with R, and predict the dressed weight of a pig getting Drug 2, whose mother weighed 140 pounds, and whose father weighed 185 pounds. Use the \texttt{predict} function to obtain the prediction and a 95\% prediction interval. Your answer is a set of three numbers, a prediction, a lower prediction limit and an upper prediction limit. \item This parallel planes regression model specifies that the differences in expected weight for the different drug treatments are the same for every possible combination of mother's weight and father's weight. Give a 95\% confidence interval for the difference in expected weight between drug treatments 2 and 3. The final answer is a pair of numbers, a lower confidence limit and an upper confidence limit. There is an easy way and a less easy way. \item In symbols, give the null hypotheses you would test to answer the following questions. Your answers are statements involving the $\beta$ values from your regression equation. \begin{enumerate} % There were more questions in an earlier draft. \item Controlling for mother's weight and father's weight, does type of drug have an effect on the expected weight of a pig? \item Controlling for mother's weight and father's weight, which drug helps the average pig gain more weight, Drug 1 or Drug 2? \item Controlling for mother's weight and father's weight, which drug helps the average pig gain more weight, Drug 1 or Drug 3? \item Controlling for mother's weight and father's weight, which drug helps the average pig gain more weight, Drug 2 or Drug 3? \end{enumerate} \pagebreak \item For each of the questions below, give the value of the $t$ or $F$ statistic (a number from your printout), and indicate whether or not you reject the null hypothesis. The numbers may or may not be part of the default output from \texttt{summary}. \begin{enumerate} \item Controlling for mother's weight and father's weight, does type of drug have an effect on the expected weight of a pig? \item Controlling for mother's weight and father's weight, which drug helps the average pig gain more weight, Drug 1 or Drug 2? \item Controlling for mother's weight and father's weight, which drug helps the average pig gain more weight, Drug 1 or Drug 3? \item Controlling for mother's weight and father's weight, which drug helps the average pig gain more weight, Drug 2 or Drug 3? \item Allowing for which drug they were given, does expected weight of a pig increase faster as a function of the mother's weight, or does it increase faster as a function of the father's weight? \end{enumerate} \item An accepted rule of thumb about influential observations is that if the maximum diagonal value of the $H$ matrix is bigger than 0.2, one or more observations might be having too much influence on the results. Is there evidence of this kind of trouble? Your printout should have the one number you need to answer the question. \item To check the residuals for possible outliers, treat the Studentized deleted residuals as $t$ statistics with a Bonferroni correction. Is there evidence of outliers? Of course the evidence for your conclusion (including the critical value) should be on your printout. \item We can assume that farmers want their pigs to weigh a lot. In plain, non-statistical language, can you offer some advice to a farmer based on these data? Remember, the farmer must be able to understand your answer or it is worthless. \end{enumerate} % End computer question \textbf{Please bring your printout to the quiz.} \end{enumerate} % \vspace{30mm} \noindent \begin{center}\begin{tabular}{l} \hspace{6in} \\ \hline \end{tabular}\end{center} This assignment was prepared by \href{http://www.utstat.toronto.edu/~brunner}{Jerry Brunner}, Department of Statistical Sciences, University of Toronto. It is licensed under a \href{http://creativecommons.org/licenses/by-sa/3.0/deed.en_US} {Creative Commons Attribution - ShareAlike 3.0 Unported License}. Use any part of it as you like and share the result freely. The \LaTeX~source code is available from the course website: \href{http://www.utstat.toronto.edu/~brunner/oldclass/302f17} {\small\texttt{http://www.utstat.toronto.edu/$^\sim$brunner/oldclass/302f17}} \end{document} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Later \item \label{centered} This question explores the practice of centering quantitative independent variables in a regression by subtracting off the mean. Geometrically, this should not alter the configuration of data points in the multi-dimensional scatterplot. All it does is shift the axes. Thus, the intercept of the least squares plane should change, but the slopes should not. \begin{enumerate} \item Consider a simple experimental study with an experimental group, a control group and a single quantitative covariate. Independently for $i=1, \ldots, n$ let \begin{displaymath} y_i = \beta_0 + \beta_1 x_i + \beta_2 d_i + \epsilon_i, \end{displaymath} where $x_i$ is the covariate and $d_i$ is an indicator dummy variable for the experimental group. If the covariate is ``centered," the model can be written \begin{displaymath} y_i = \beta_0^* + \beta_1^* (x_i-\overline{x}) + \beta_2^* d_i + \epsilon_i, \end{displaymath} where $\overline{x} = \frac{1}{n}\sum_{i=1}^n x_i$. \begin{enumerate} \item Express the $\beta^*$ quantities in terms of the original $\beta$ quantities. \item Let's generalize this. For the general linear model in matrix form suppose $\boldsymbol{\beta}^* = \mathbf{A}\boldsymbol{\beta}$, where $\mathbf{A}$ is a square matrix with an inverse. This makes $\boldsymbol{\beta}^*$ a one-to-one function of $\boldsymbol{\beta}$. Of course $X$ is affected as well. Show that $\mathbf{b}^* = \mathbf{A}\mathbf{b}$. \item Give the matrix $\mathbf{A}$ for this $p=3$ model. \item If the data are centered, what is $E(y|x)$ for the experimental group, and what is $E(y|x)$ for the control group? \end{enumerate} \item In the following model, there are $k$ quantitative independent variables. The un-centered version is \begin{displaymath} y_i = \beta_0 + \beta_1 x_{i,1} + \ldots + \beta_{k} x_{i,k} + \epsilon_i, \end{displaymath} and the centered version is \begin{displaymath} y_i = \beta_0^* + \beta_1^* (x_{i,1}-\overline{x}_1) + \ldots + \beta_{k}^* (x_{i,k}-\overline{x}_{k}) + \epsilon_i, \end{displaymath} where $\overline{x}_j = \frac{1}{n}\sum_{i=1}^n x_{i,j}$ for $j = 1, \ldots, k$. \begin{enumerate} \item What is $\beta_0^*$ in terms of the $\beta$ quantities? \item What is $\beta_j^*$ in terms of the $\beta$ quantities? \item What is $\widehat{\beta}_0$ in terms of the $\widehat{\beta}^*$ quantities? \item Using $\sum_{i=1}^n\widehat{y}_i = \sum_{i=1}^ny_i$, show that $\widehat{\beta}_0^* = \overline{y}$. \end{enumerate} % \newpage \item Now consider again the study with an experimental group, a control group and a single covariate. This time the interaction is included. \begin{displaymath} y_i = \beta_0 + \beta_1 x_i + \beta_2 d_i + \beta_3 x_id_i + \epsilon_i \end{displaymath} The centered version is \begin{displaymath} y_i = \beta_0^* + \beta_1^* (x_i-\overline{x}) + \beta_2^* d_i + \beta_3^* (x_i-\overline{x})d_i + \epsilon_i \end{displaymath} \begin{enumerate} \item Express the $\beta^*$ quantities from the centered model in terms of the $\beta$ quantities from the un-centered model. Is the correspondence one to one? \item \label{difatmean} For the un-centered model, what is the difference between $E(y|X=\overline{x})$ for the experimental group and $E(y|X=\overline{x})$ for the control group? \item What is the difference between intercepts for the centered model? Compare this to your answer to Question~\ref{difatmean}. \end{enumerate} \end{enumerate}