\documentclass[11pt]{article} %\usepackage{amsbsy} % for \boldsymbol and \pmb \usepackage{graphicx} % To include pdf files! \usepackage{amsmath} \usepackage{amsbsy} \usepackage{amsfonts} % for \mathbb{R} The set of reals \usepackage[colorlinks=true, pdfstartview=FitV, linkcolor=blue, citecolor=blue, urlcolor=blue]{hyperref} % For links \usepackage{fullpage} %\pagestyle{empty} % No page numbers \begin{document} %\enlargethispage*{1000 pt} \begin{center} {\Large \textbf{STA 302f13 Assignment Nine}}\footnote{Copyright information is at the end of the last page.} \vspace{1 mm} \end{center} \noindent This assignment assumes you are using the \href{http://www.utstat.toronto.edu/~brunner/302f13/formulas/302f13Formulas2.pdf} {Formula sheet}. There is a link on the course home page in case the one in this document does not work. The formula sheet (or part of it) will be provided with the quiz. \textbf{Bring your printouts for Question~\ref{trees} to the quiz, including the plots}. \begin{enumerate} \item For the general linear regression model, show that the square of the sample correlation between $Y$ and $\widehat{Y}$ values is equal to $R^2$. \item This question compares the error terms $\epsilon_i$ to the residuals $\widehat{\epsilon}_i$. Answer True or False to each statement. For statements about the residuals, show a calculation that proves your answer. You may use anything on the formula sheet. \begin{enumerate} \item $E(\epsilon_i) = 0$ \item $E(\widehat{\epsilon}_i) = 0$ \item $Var(\epsilon_i) = 0$ \item $Var(\widehat{\epsilon}_i) = 0$ \item $\epsilon_i$ has a normal distribution. \item $\widehat{\epsilon}_i$ has a normal distribution. \item $\epsilon_1, \ldots, \epsilon_n$ are independent. \item $\widehat{\epsilon}_1, \ldots, \widehat{\epsilon}_n$ are independent. \end{enumerate} \item One of these statements is true, and the other is false. Pick one, and show it is true with a quick calculation. Start with something from the formula sheet. \begin{itemize} \item $\widehat{\mathbf{Y}} = \mathbf{X} \widehat{\boldsymbol{\beta}} + \widehat{\boldsymbol{\epsilon}}$ \item $\mathbf{Y} = \mathbf{X} \widehat{\boldsymbol{\beta}} + \widehat{\boldsymbol{\epsilon}}$ \end{itemize} As the saying goes, ``Data equals fit plus residual." \item The \emph{deleted residual} is $\widehat{\epsilon}_{(i)} = Y_i - \mathbf{x}^\prime_i \widehat{\boldsymbol{\beta}}_{(i)}$, where $\widehat{\boldsymbol{\beta}}_{(i)}$ is defined as usual, but based on the $n-1$ observations with observation $i$ deleted. \begin{enumerate} \item Guided by an expression on the formula sheet, write the formula for the Studentized deleted residual. You don't have to prove anything. You will need the symbols $\mathbf{X}_{(i)}$ and $s_{(i)}$, which are defined in the obvious way. \item If the model is correct, what is the distribution of the Studentized deleted residual? Make sure you have the degrees of freedom right. \item Why are numerator and denominator independent? \end{enumerate} \item For the general linear regression model, are $\widehat{\mathbf{Y}}$ and $\widehat{\boldsymbol{\epsilon}}$ independent? \begin{enumerate} \item Answer Yes or No and prove your answer. \item What does this imply about the plot of predicted values against residuals? \end{enumerate} \item For the general linear regression model, are $\mathbf{Y}$ and $\widehat{\mathbf{Y}}$ independent? Answer Yes or No and prove your answer. % No, C = sigma^2 H neq 0. Plus sample r^2 = R^2 \item For the general linear regression model, are $\mathbf{Y}$ and $\widehat{\boldsymbol{\epsilon}}$ independent? Answer Yes or No and prove your answer. % No, C = sigma^2 (I-H) neq 0. \item Prove that the sample correlation between residuals and independent variable values must equal exactly zero. Does this result depend on the correctness of the model? \item \label{trees} Lecture slide set 6 used the \texttt{trees} data. Typing \texttt{help(trees)} at the R prompt gives more information. For this question, bring your R printouts to the quiz, \emph{including the plots}. \begin{enumerate} \item Fit an ordinary model with two independent variables. How much of the variability in Volume is explained? You have to admit, that's pretty good. \item Now let's look at the deleted Studentized residuals. One student made an excellent suggestion, which was to look at boxplots. Try \texttt{boxplot(vaname)}, where \texttt{vaname} id te nsame of the deleted Studnetized residual. If you don't know what a boxplot is, look in the Wikipedia. This part is interesting, but it will not be on the quiz. Do you see one possible high outlier. \item Now treat the deleted Studentized residuals as $t$-test statistics, with a Bonferroni correction to achieve a \emph{joint} significance level of $0.05$. What is the critical value? It's a number on your R printout. This \emph{could} be on the quiz. % 3.504931 \item Is there evidence of outliers? Answer yes or No. % No. \item Now plot predicted values against standardized residuals. Put a title on the plot. See \texttt{help(title)}. Do you see anything fishy, or perhaps wavy? \item Now plot the independent variables in the model against the standardized residuals. It's a bit subejctive, but when I do this I see a curvilinear trend for one independent variable, but not for the other. Which one? Then I thought about it for a while. Finally, combining a bit of geometry with what little I know about trees, I came up with a model. This model has \emph{one} independent variable, a function of Height and Girth, and it explains almost 98\% of the variation in volume. The residual plots look pretty clean. Can you guess my model? \end{enumerate} \end{enumerate} \vspace{20mm} \noindent \begin{center}\begin{tabular}{l} \hspace{6in} \\ \hline \end{tabular}\end{center} This assignment was prepared by \href{http://www.utstat.toronto.edu/~brunner}{Jerry Brunner}, Department of Statistical Sciences, University of Toronto. It is licensed under a \href{http://creativecommons.org/licenses/by-sa/3.0/deed.en_US} {Creative Commons Attribution - ShareAlike 3.0 Unported License}. Use any part of it as you like and share the result freely. The \LaTeX~source code is available from the course website: \href{http://www.utstat.toronto.edu/~brunner/oldclass/302f13} {\small\texttt{http://www.utstat.toronto.edu/$^\sim$brunner/oldclass/302f13}} \end{document} \vspace{5mm} \noindent These problems are preparation for the quiz in tutorial on Friday November 1st, and are not to be handed in. # R work rm(list=ls()) mod1 = lm(Volume ~ Girth + Height, data=trees); summary(mod1) rstud = rstudent(mod1) boxplot(rstud) # Studentized deleted residuals as t-tests with Bonferroni correction attach(trees) n = length(Volume); n dfe = mod1$df.residual; dfe #$ alpha = 0.05; a = alpha/n; bcrit = qt(1-a/2,dfe-1); bcrit sort(rstud) # Predicted values vs. standardized residuals r = rstandard(mod1); yhat = mod1$fitted.values #$ plot(yhat,r,) title('Predicted values vs. standardized residuals: Model 1') # Variables in the model versus standardized residuals plot(Girth,r,ylab = 'Standardized Residual') plot(Height,r,ylab = 'Standardized Residual') cvol = Height*Girth^2 # Proportional to volume of a cylinder mod2 = lm(Volume ~ cvol); summary(mod2) # 98% explained # Check standardized residuals - clean r2 = rstandard(mod2); sort(r2) plot(Girth,r2,ylab = 'Standardized Residual'); title('Model 2') plot(Height,r2,ylab = 'Standardized Residual'); title('Model 2')