\documentclass[11pt]{article} %\usepackage{amsbsy} % for \boldsymbol and \pmb \usepackage{graphicx} % To include pdf files! \usepackage{amsmath} \usepackage{amsbsy} \usepackage{amsfonts} % for \mathbb{R} The set of reals \usepackage[colorlinks=true, pdfstartview=FitV, linkcolor=blue, citecolor=blue, urlcolor=blue]{hyperref} % For links \usepackage{fullpage} %\pagestyle{empty} % No page numbers \begin{document} %\enlargethispage*{1000 pt} \begin{center} {\Large \textbf{STA 302f14 Assignment Eight}}\footnote{Copyright information is at the end of the last page.} \vspace{1 mm} \end{center} \noindent This assignment assumes you are using the \href{http://www.utstat.toronto.edu/~brunner/302f13/formulas/302f14Formulas7.pdf} {Formula sheet}. There is a link on the course home page in case the one in this document does not work. The formula sheet (or part of it) will be provided with the quiz. \textbf{Bring your printouts for Question~\ref{trees} to the quiz, including the plots}. \begin{enumerate} \item This question compares the error terms $\epsilon_i$ to the residuals $\widehat{\epsilon}_i$. Answer True or False to each statement. For statements about the residuals, show a calculation that proves your answer. You may use anything on the formula sheet. \begin{enumerate} \item $E(\epsilon_i) = 0$ \item $E(\widehat{\epsilon}_i) = 0$ \item $Var(\epsilon_i) = 0$ \item $Var(\widehat{\epsilon}_i) = 0$ \item $\epsilon_i$ has a normal distribution. \item $\widehat{\epsilon}_i$ has a normal distribution. \item $\epsilon_1, \ldots, \epsilon_n$ are independent. \item $\widehat{\epsilon}_1, \ldots, \widehat{\epsilon}_n$ are independent. \end{enumerate} \item One of these statements is true, and the other is false. Pick one, and show it is true with a quick calculation. Start with something from the formula sheet. \begin{itemize} \item $\widehat{\mathbf{Y}} = \mathbf{X} \widehat{\boldsymbol{\beta}} + \widehat{\boldsymbol{\epsilon}}$ \item $\mathbf{Y} = \mathbf{X} \widehat{\boldsymbol{\beta}} + \widehat{\boldsymbol{\epsilon}}$ \item $\widehat{\mathbf{Y}} = \mathbf{X} \boldsymbol{\beta} + \widehat{\boldsymbol{\epsilon}}$ \end{itemize} As the saying goes, ``Data equals fit plus residual." \item The \emph{deleted residual} is $\widehat{\epsilon}_{(i)} = Y_i - \mathbf{x}^\prime_i \widehat{\boldsymbol{\beta}}_{(i)}$, where $\widehat{\boldsymbol{\beta}}_{(i)}$ is defined as usual, but based on the $n-1$ observations with observation $i$ deleted. \begin{enumerate} \item Guided by an expression on the formula sheet, write the formula for the Studentized deleted residual. You don't have to prove anything. You will need the symbols $\mathbf{X}_{(i)}$ and $MSE_{(i)}$, which are defined in the natural way. \item If the model is correct, what is the distribution of the Studentized deleted residual? Make sure you have the degrees of freedom right. \item Why are numerator and denominator independent? \end{enumerate} \item For the general linear regression model, are $\mathbf{Y}$ and $\widehat{\mathbf{Y}}$ independent? Answer Yes or No and prove your answer. \item For the general linear regression model, show that the squared sample correlation between $\mathbf{Y}$ and $\widehat{\mathbf{Y}}$ equals $R^2$. What does this imply about the plot of observed versus predicted values of the dependent variable? \item For the general linear regression model, are $\widehat{\mathbf{Y}}$ and $\widehat{\boldsymbol{\epsilon}}$ independent? \begin{enumerate} \item Answer Yes or No and prove your answer. \item What does this imply about the plot of predicted values against residuals? \end{enumerate} \item For the general linear regression model, are $\mathbf{Y}$ and $\widehat{\mathbf{Y}}$ independent? Answer Yes or No and prove your answer. % No, C = sigma^2 H neq 0. Plus sample r^2 = R^2 \item For the general linear regression model, are $\mathbf{Y}$ and $\widehat{\boldsymbol{\epsilon}}$ independent? Answer Yes or No and prove your answer. % No, C = sigma^2 (I-H) neq 0. \item For the general linear regression model, calculate $\mathbf{X}^\prime \, \widehat{\boldsymbol{\epsilon}}$. This will help with the next question. \item For the general linear regression model, \begin{enumerate} \item Why does it not make sense to ask about independence of the independent variable values and the residuals? \item Prove that the sample correlation between residuals and independent variable values must equal exactly zero. \item Does this result depend on the correctness of the model? \item What does the correlation between residuals and independent variable values imply about the corresponding plots? \end{enumerate} \item In last week's analysis of the \texttt{Census Tract} data, you did a simultaneous test of \texttt{old}, \texttt{labor} and \texttt{income} controlling for the other variables. Here's a bit of my output. \begin{verbatim} > # Test old, labor and income > redmodel = lm(crimerate ~ area+urban+docs+beds+hs) > anova(redmodel,fullmodel) Analysis of Variance Table Model 1: crimerate ~ area + urban + docs + beds + hs Model 2: crimerate ~ area + urban + old + docs + beds + hs + labor + income Res.Df RSS Df Sum of Sq F Pr(>F) 1 135 19817 2 132 19792 3 25.683 0.0571 0.982 \end{verbatim} After controlling for other variables in the model, what proportion of the remaining variation is explained by \texttt{old}, \texttt{labor} and \texttt{income}? The answer is a number between zero and one that you can get with a calculator. Show some work. % 3*0.0571/(132+3*0.0571) = 0.001296 \newpage \item \label{trees} Lecture slide set 7 used the \texttt{trees} data. Typing \texttt{help(trees)} at the R prompt gives more information. For this question, bring your R printouts to the quiz, \emph{including the plots}. \begin{enumerate} \item Fit an ordinary model with two independent variables. How much of the variability in Volume is explained? You have to admit, that's pretty good. % summary(lm(Volume ~ Girth+Height,data=trees)) \item Once you control for \texttt{Girth}, what proportion of the remaining variation in \texttt{Volume} is explained by \texttt{Height}? The answer is a number between zero and one that can be obtained from the default output (that is, the output of \texttt{summary}) using a calculator. % 1*2.607^2/(28+1*2.607^2) = 0.1953202 \item Once you control for \texttt{Height}, what proportion of the remaining variation in \texttt{Volume} is explained by \texttt{Girth}? The answer is a number between zero and one that can be obtained from the default output (that is, the output of \texttt{summary}) using a calculator. % 1*17.816^2/(28+1*17.816^2) = 0.9189369 \item Now let's look at the deleted Studentized residuals. One student made an excellent suggestion, which was to look at boxplots. Try \texttt{boxplot(varname)}, where \texttt{varname} is the name of the deleted Studentized residual. If you don't know what a boxplot is, look in the Wikipedia. This part is interesting, but it will not be on the quiz. Do you see one possible high outlier? \item Now treat the deleted Studentized residuals as $t$-test statistics, with a Bonferroni correction to achieve a \emph{joint} significance level of $0.05$. What is the critical value? It's a number you get from R and display on your printout. This \emph{could} be on the quiz. % 3.504931 \item Is there evidence of outliers? Answer yes or No. % No. \item Now plot predicted values against standardized residuals. Put a title on the plot. See \texttt{help(title)}. Do you see anything fishy, or perhaps wavy? \item Now plot the independent variables in the model against the standardized residuals. It's a bit subejctive, but when I do this I see a curvilinear trend for one independent variable, but not for the other. Which one? Then I thought about it for a while. Finally, combining a bit of geometry with what little I know about trees, I came up with a model. This model has \emph{one} independent variable, a function of Height and Girth, and it explains almost 98\% of the variation in volume. The residual plots look pretty clean. Can you guess my model? \end{enumerate} \end{enumerate} \vspace{20mm} \noindent \begin{center}\begin{tabular}{l} \hspace{6in} \\ \hline \end{tabular}\end{center} This assignment was prepared by \href{http://www.utstat.toronto.edu/~brunner}{Jerry Brunner}, Department of Statistical Sciences, University of Toronto. It is licensed under a \href{http://creativecommons.org/licenses/by-sa/3.0/deed.en_US} {Creative Commons Attribution - ShareAlike 3.0 Unported License}. Use any part of it as you like and share the result freely. The \LaTeX~source code is available from the course website: \href{http://www.utstat.toronto.edu/~brunner/oldclass/302f14} {\small\texttt{http://www.utstat.toronto.edu/$^\sim$brunner/oldclass/302f14}} \end{document} \vspace{5mm} \noindent These problems are preparation for the quiz in tutorial on Friday November 1st, and are not to be handed in. # R work rm(list=ls()) mod1 = lm(Volume ~ Girth + Height, data=trees); summary(mod1) 1*2.607^2/(28+1*2.607^2); 1*17.816^2/(28+1*17.816^2) rstud = rstudent(mod1) boxplot(rstud) # Studentized deleted residuals as t-tests with Bonferroni correction attach(trees) n = length(Volume); n dfe = mod1$df.residual; dfe #$ alpha = 0.05; a = alpha/n; bcrit = qt(1-a/2,dfe-1); bcrit sort(rstud) # Predicted values vs. standardized residuals r = rstandard(mod1); yhat = mod1$fitted.values #$ plot(yhat,r,) title('Predicted values vs. standardized residuals: Model 1') # Variables in the model versus standardized residuals plot(Girth,r,ylab = 'Standardized Residual') plot(Height,r,ylab = 'Standardized Residual') cvol = Height*Girth^2 # Proportional to volume of a cylinder mod2 = lm(Volume ~ cvol); summary(mod2) # 98% explained # Check standardized residuals - clean r2 = rstandard(mod2); sort(r2) plot(Girth,r2,ylab = 'Standardized Residual'); title('Model 2') plot(Height,r2,ylab = 'Standardized Residual'); title('Model 2')