\documentclass[12pt]{article} %\usepackage{amsbsy} % for \boldsymbol and \pmb \usepackage{graphicx} % To include pdf files! \usepackage{amsmath} \usepackage{amsbsy} \usepackage{amsfonts} % for \mathbb{R} The set of reals \usepackage[colorlinks=true, pdfstartview=FitV, linkcolor=blue, citecolor=blue, urlcolor=blue]{hyperref} % For links \usepackage{fullpage} %\pagestyle{empty} % No page numbers \begin{document} %\enlargethispage*{1000 pt} \begin{center} {\Large \textbf{STA 302f16 Assignment Ten}}\footnote{Copyright information is at the end of the last page.} \vspace{1 mm} \end{center} \noindent Except for Problem~\ref{trees}, these problems are preparation for the quiz in tutorial on Thursday November 24th, and are not to be handed in. Please bring your printout for Problem~\ref{trees} to the quiz. Do not write anything on the printout in advance of the quiz, except possibly your name and student number. \begin{enumerate} % Did I already ask this? \item Suppose you fit (estimate the parameters of) a regression model, obtaining $\mathbf{b}$, $\widehat{\mathbf{y}}$ and $\mathbf{e}$. Call this Model One. \begin{enumerate} \item In attempt to squeeze some more information out of the data, you fit a second regression model, using $\mathbf{e}$ from Model One as the dependent variable, and exactly the same $X$ matrix as Model One. Call this Model Two. \begin{enumerate} \item What is $\mathbf{b}$ for Model Two? Show your work and simplify. \item What is $\widehat{\mathbf{y}}$ for Model Two? Show your work and simplify. \item What is $\mathbf{e}$ for Model Two? Show your work and simplify. \item What is $s^2$ for Model Two? \end{enumerate} \item Now you fit a \emph{third} regression model, this time using $\widehat{\mathbf{y}}$ from Model One as the dependent variable, and again, exactly the same $X$ matrix as Model One. Call this Model Three. \begin{enumerate} \item What is $\mathbf{b}$ for Model Three? Show your work and simplify. \item What is $\widehat{\mathbf{y}}$ for Model Three? Show your work and simplify. \item What is $\mathbf{e}$ for Model Three? Show your work and simplify. \item What is $s^2$ for Model Three? \end{enumerate} \end{enumerate} \item Data for a regression study are collected at two different locations; $n_1$ observations are collected at location one, and $n_2$ observations are collected at location two. The same independent variables are used at each location. We need to know whether the error variance $\sigma^2$ is the same at the two locations, possibly because we are concerned about data quality. Recall the definition of the $F$ distribution. If $W_1 \sim \chi^2(\nu_1)$ and $W_2 \sim \chi^2(\nu_2)$ are independent, then $F = \frac{W_1/\nu_1}{W_2/\nu_2} \sim F(\nu_1,\nu_2)$. Suggest a statistic for testing $H_0: \sigma^2_1=\sigma^2_2$. Using facts from the formula sheet, show it has an $F$ distribution when $H_0$ is true. Don't forget to state the degrees of freedom. Assume that data coming from the two locations are independent. \newpage \item Assume the usual linear model with normal errors; see the formula sheet. We know that a one-to-one linear transformation of the independent variables affects the interpretation of the $\beta_j$ parameters, but otherwise it has no effect. Suppose that a model is to be used for prediction only, so that interpretation of the regression coefficients is not an issue. Here is a transformation that has interesting effects; it is also convenient for some purposes. Since $X^\prime X$ is symmetric, we have the spectral decomposition $X^\prime X = CDC^\prime$, where $D$ is a diagonal matrix of eigenvalues, and the columns of $C$ are the corresponding eigenvectors. Suppose we transform $X$ by $X^*=XC$. This also transforms $\boldsymbol{\beta}$, and the corresponding estimated $\boldsymbol{\beta}^*$ is denoted by $\mathbf{b}^*$. \begin{enumerate} \item Could any of the eigenvalues be negative or zero? Answer Yes or No and briefly explain. This might require some review. \item Give a formula for $\mathbf{b}^*$. Simplify. \item What is the distribution of $\mathbf{b}^*$? Simplify. \item What is $Var(b^*_j)$? \item Are the $b^*_j$ random variables independent? Answer Yes or No. Why? \item What is the variance of the linear combination $\ell_0 b^*_0 + \ell_1 b^*_1 + \cdots + \ell_k b^*_k$? \end{enumerate} \item A forestry company has developed a regression equation for predicting the amount of useable wood that they will get from a tree, based on a set of measurements that can be taken without cutting the tree down. They are convinced that a model with normal error terms is right. They have $\mathbf{b}$ and $s^2$ based on a set of $n$ trees they measured first and then cut down, and they know how to calculate a predicted $y$ and a prediction interval for the amount of wood they will get from a single tree. But that's not what they want. They have a set of $r$ more trees they are planning to cut down, and they have measured the independent variables for each tree, yielding $\mathbf{x_{n+1} \ldots, \mathbf{x}_{n+r}}$. What they want is a prediction of the \emph{total} amount of wood they will get from these trees, along with a $95\%$ prediction interval for the total. \begin{enumerate} \item The quantity they want to predict is $w = \sum_{j=n+1}^{n+r}y_j$, where $y_j = \mathbf{x}_j^\prime\boldsymbol{\beta} + \epsilon_j$. What is the distribution of $w$? You can just write down the answer without showing any work. \item Let $\widehat{w}$ denote the prediction of $w$. It is calculated using the company's regression data along with $\mathbf{x_{n+1} \ldots, \mathbf{x}_{n+r}}$. Give a formula for $\widehat{w}$. Simplify. \item What is the distribution of $w-\widehat{w}$? Show your work, but don't use moment-generating functions. Just write down expected value and calculate the variance. \item Now standardize $w-\widehat{w}$ to obtain a standard normal. Call it $z$. \item Divide $z$ by the square root of a chi-squared random variable, divided by its degrees of freedom, and simplify. Call it $t$. What are the degrees of freedom? \item How do you know that numerator and denominator are independent? \item Using your formula for $t$, derive the $(1-\alpha)\times 100\%$ prediction interval for $w$. Please use the symbol $t_{\alpha/2}$ for the critical value. \end{enumerate} \item \label{trees} This question uses the \texttt{trees} data you saw in the R lecture (``Least squares with R"). Start by fitting a model with just \texttt{Girth} and \texttt{Height}. The forestry company wants to predict the volume of wood they would obtain if they cut down three particular trees. The first tree has a girth of 11.0 and a height of 75. The second tree has a girth of 14.8 and a height of 80. The third tree has a girth of 10.5 and a height of 65. Using R, \begin{enumerate} \item Calculate a predicted amount of wood the company will obtain by cutting down these trees. The answer is a number. \item Calculate a 95\% prediction interval for the total amount of wood. The answer is a pair of numbers, a lower prediction limit and an upper prediction limit. \end{enumerate} \item Regression diagnostics are mostly based on the residuals. This question compares the error terms $\epsilon_i$ to the residuals $e_i$. Answer True or False to each statement. For statements about the residuals, show a calculation that proves your answer. You may use anything on the formula sheet. \begin{enumerate} \item $E(\epsilon_i) = 0$ \item $E(e_i) = 0$ \item $Var(\epsilon_i) = 0$ \item $Var(e_i) = 0$ \item $\epsilon_i$ has a normal distribution. \item $e_i$ has a normal distribution. \item $\epsilon_1, \ldots, \epsilon_n$ are independent. \item $e_1, \ldots, e_n$ are independent. \end{enumerate} \item One of these statements is true, and the other is false. Pick one, and show it is true with a quick calculation. Start with something from the formula sheet. \begin{itemize} \item $\widehat{\mathbf{y}} = X \mathbf{b} + \mathbf{e}$ \item $\mathbf{y} = X \mathbf{b} + \mathbf{e}$ \item $\widehat{\mathbf{y}} = X \boldsymbol{\beta} + \mathbf{e}$ \end{itemize} As the saying goes, ``Data equals fit plus residual." \pagebreak \item The \emph{deleted residual} is $e_{(i)} = y_i - \mathbf{x}^\prime_i \mathbf{b}_{(i)}$, where $\mathbf{b}_{(i)}$ is defined as usual, but based on the $n-1$ observations with observation $i$ deleted. \begin{enumerate} \item Guided by an expression on the formula sheet, write the formula for the Studentized deleted residual. You don't have to prove anything. You will need the symbols $X_{(i)}$ and $s^2_{(i)}$, which are defined in the natural way. \item If the model is correct, what is the distribution of the Studentized deleted residual? Make sure you have the degrees of freedom right. \item Why are numerator and denominator independent? \end{enumerate} \item For the general linear regression model, are $\widehat{\mathbf{y}}$ and $\mathbf{e}$ independent? \begin{enumerate} \item Answer Yes or No and prove your answer. \item What does this imply about the plot of predicted values against residuals? \end{enumerate} \item For the general linear regression model, are $\mathbf{y}$ and $\widehat{\mathbf{y}}$ independent? Answer Yes or No and prove your answer. % No, C = sigma^2 H neq 0. Plus sample r^2 = R^2 \item For the general linear regression model, are $\mathbf{y}$ and $\mathbf{e}$ independent? Answer Yes or No and prove your answer. % No, C = sigma^2 (I-H) neq 0. \item For the general linear regression model, calculate $X^\prime \, \mathbf{e}$ one more time. This will help with the next question. \item For the general linear regression model in which $X$ is a matrix of constants, \begin{enumerate} \item Why does it not make sense to ask about independence of the independent variable values and the residuals? \item Prove that the sample correlation between residuals and independent variable values must equal exactly zero. \item Does this result depend on the correctness of the model? \item What does the sample correlation between residuals and independent variable values imply about the corresponding plots? \end{enumerate} \end{enumerate} % End of assignment \noindent Please bring your printout for Question~\ref{trees} to the quiz. \textbf{Your printout should show \emph{all} R input and output, and \emph{only} R input and output}. Do not write anything on your printouts except your name and student number. \vspace{5mm} \noindent \begin{center}\begin{tabular}{l} \hspace{6in} \\ \hline \end{tabular}\end{center} This assignment was prepared by \href{http://www.utstat.toronto.edu/~brunner}{Jerry Brunner}, Department of Statistical Sciences, University of Toronto. It is licensed under a \href{http://creativecommons.org/licenses/by-sa/3.0/deed.en_US} {Creative Commons Attribution - ShareAlike 3.0 Unported License}. Use any part of it as you like and share the result freely. The \LaTeX~source code is available from the course website: \href{http://www.utstat.toronto.edu/~brunner/oldclass/302f16} {\small\texttt{http://www.utstat.toronto.edu/$^\sim$brunner/oldclass/302f16}} \end{document} \item Now consider a regression model \emph{without} an intercept, but \emph{with} possibly unequal slopes. Make a table to show how the dummy variables could be set up, and write the regression model. Again, please use $x$ for age and make its regression coefficient $\beta_1$. This model needs to have the \emph{same number of regression coefficients as the model of Question~\ref{interac}}, so you have to think about this a little. For each treatment condition, what is the conditional expected value of $Y$? The answer is in terms of $x$ and the $\beta$ values. Please put these values as the last column of your table.