\documentclass[11pt]{article} %\usepackage{amsbsy} % for \boldsymbol and \pmb \usepackage{graphicx} % To include pdf files! \usepackage{amsmath} \usepackage{amsbsy} \usepackage{amsfonts} % for \mathbb{R} The set of reals \usepackage[colorlinks=true, pdfstartview=FitV, linkcolor=blue, citecolor=blue, urlcolor=blue]{hyperref} % For links \usepackage{fullpage} %\pagestyle{empty} % No page numbers \begin{document} %\enlargethispage*{1000 pt} \begin{center} {\Large \textbf{STA 302f17 Assignment Eight}}\footnote{Copyright information is at the end of the last page.} \vspace{1 mm} \end{center} \noindent Except for Problems~\ref{trees} and \ref{curves}, these problems are preparation for the quiz in tutorial on Thursday November 16th, and are not to be handed in. Please bring your printouts for Problems~\ref{trees} and \ref{curves} to the quiz. Do not write anything on the printouts in advance of the quiz, except possibly your name and student number. \begin{enumerate} \item Based on the general linear model with normal error terms, \begin{enumerate} \item Prove the $t$ distribution given on the formula sheet for a new observation $y_0$. Use earlier material on the formula sheet. For example, how do you know numerator and denominator are independent? \item Derive the $(1-\alpha)\times 100\%$ prediction interval for a new observation from this population, in which the independent variable values are given in $\mathbf{x}_0$. ``Derive" means show the High School algebra. \end{enumerate} \item A forestry company has developed a regression equation for predicting the amount of useable wood that they will get from a tree, based on a set of measurements that can be taken without cutting the tree down. They are convinced that a model with normal error terms is right. They have $\mathbf{b}$ and $s^2$ based on a set of $n$ trees they measured first and then cut down, and they know how to calculate a predicted $y$ and a prediction interval for the amount of wood they will get from a single tree. But that's not what they want. They have a set of $m$ more trees they are planning to cut down, and they have measured the independent variables for each tree, yielding $\mathbf{x_{n+1} \ldots, \mathbf{x}_{n+m}}$. What they want is a prediction of the \emph{total} amount of wood they will get from these trees, along with a $95\%$ prediction interval for the total. \begin{enumerate} \item The quantity they want to predict is $w = \sum_{j=n+1}^{n+m}y_j$, where $y_j = \mathbf{x}_j^\prime\boldsymbol{\beta} + \epsilon_j$. What is the distribution of $w$? You can just write down the answer without showing any work. \item Let $\widehat{w}$ denote the prediction of $w$. It is calculated using the company's regression data along with $\mathbf{x}_{n+1} \ldots, \mathbf{x}_{n+m}$. Give a formula for $\widehat{w}$. \item What is the distribution of $\widehat{w}$? \item What is the distribution of $w-\widehat{w}$? \item Now standardize $w-\widehat{w}$ to obtain a standard normal. Call it $z$. \item Divide $z$ by the square root of a chi-squared random variable, divided by its degrees of freedom, and simplify. Call it $t$. What are the degrees of freedom? \item How do you know that numerator and denominator are independent? \item Using your formula for $t$, derive the $(1-\alpha)\times 100\%$ prediction interval for $w$. Please use the symbol $t_{\alpha/2}$ for the critical value. \end{enumerate} \pagebreak \item \label{trees} This question uses the \texttt{trees} data you saw in the R lecture (``Least squares with R"). Start by fitting a model with just \texttt{Girth} and \texttt{Height}. The forestry company wants to predict the volume of wood they would obtain if they cut down three particular trees. The first tree has a girth of 11.0 and a height of 75. The second tree has a girth of 14.8 and a height of 80. The third tree has a girth of 10.5 and a height of 65. Using R, \begin{enumerate} \item Calculate a predicted amount of wood the company will obtain by cutting down these trees. The answer is a number. \item Calculate a 95\% prediction interval for the total amount of wood from the three trees. The answer is a pair of numbers, a lower prediction limit and an upper prediction limit. \end{enumerate} \textbf{Please bring your printout to the quiz.} \item Suppose you have a random sample from a normal distribution, say $y_1, \ldots, y_n \stackrel{i.i.d.}{\sim} N(\mu,\sigma^2)$. If someone randomly sampled another observation from this population and asked you to guess what it was, there is no doubt you would say $\overline{y}$, and a confidence interval for $\mu$ is routine. But what if you were asked for a \emph{prediction} interval for a \emph{new} observation? Accordingly, suppose the normal model is reasonable and you observe a sample mean of $\overline{y} = 7.5$ and a sample variance (with $n-1$ in the denominator) of $s^2=3.82$. The sample size is $n=14$. Give a $95\%$ prediction interval for the next observation. The answer is a pair of numbers. Be able to show your work. You can get the distribution result you need from the formula sheet, or you can re-derive it for this special case. Be able to do it both ways. You should use R to get the critical value, but don't bother to bring your R printout for this question. \item For a general multiple regression model with an intercept and $k$ independent variables, show that the squared sample correlation between the $y$ and $\widehat{y}$ values is equal to $R^2$. Thus, a scatterplot of $\widehat{y}_i$ versus $y_i$ gives a picture of the strength of overall relationship between the independent variables and the dependent variable. \item To diagnose problems with the model, it is standard practice to plot residuals against each independent variable in the model. Calculate the sample correlation between the $x_{ij}$ values (for a general $j$) and the $e_i$ values, assuming $\sum_{i=1}^n e_i = 0$. This is why you would expect the scatterplot to show nothing but a shapeless cloud of points if the model is correct. \item A plot of $\widehat{y}_i$ and $e_i$ values also should show nothing. \begin{enumerate} \item Show that $\widehat{\mathbf{y}}$ and $\mathbf{e}$ are independent under normality. Did you need to assume that the residuals add to zero? \item Show that if $\sum_{i=1}^n e_i = 0$, the sample correlation between the $\widehat{y}_i$ and $e_i$ values is exactly zero. \end{enumerate} \pagebreak \item \label{curves} One of the built-in R datasets is \texttt{Puromycin}. Type the name to see it; there are only 23 lines of data. I believe the cases are test tubes; there were $n=23$ test tubes. The dependent variable is the rate of a chemical reaction, specifically an enzymatic reaction. The test tubes contain cells and also a \emph{substrate}, a reactant which is consumed during the enzymatic reaction. The independent variables are concentration of the substrate and whether or not the cells are treated with puromycin, an antibiotic. \begin{enumerate} \item Fit a model with just concentration and treatment with puromycin. \begin{enumerate} \item Controlling for concentration of the substrate, does treatment with puromycin have an effect? If so, what is the effect on the rate of the chemical reaction? Of course you should be able to state the null hypothesis, and also give the numerical value of the test statistic and so on. \item Controlling for treatment with puromycin, does concentration of the substrate affect the rate of the chemical reaction. If so, does higher concentration speed up the reaction, or slow it down? \item You would not expect a straight-line relationship between concentration and rate of a chemical reaction. Verify this with a residual plot. \end{enumerate} \item There are better ways to analyze this data set, but let's do a rough version using polynomial regression. Do the right thing and fit another model. \begin{enumerate} \item One of the default $t$-tests lets you verify that the relationship between concentration and rate is curvilinear. Which one? Give the value of the test statistic and the $p$-value. \item How can you tell from the output of \texttt{summary} that the function is concave down? \end{enumerate} \item Plot the residuals from your second model against concentration. I think I see more curvyness, maybe even with two bends. I see a possible outlier too, but let it go for now. Add a cubic term to your regression model. The output of \texttt{summary} has a test for whether the cubic term significantly improves model fit. What do you conclude? Is the cubic term helpful? \item I think we should be fairly happy, because taken together, the polynomial terms improved the $R^2$ from 0.714 to 0.941. Plot the residuals again. Do you see that possible outlier? \item Treating the Studentized deleted residuals as test statstics and employing a Bonferroni correction, test for possible outliers. \begin{enumerate} \item What is the Bonferroni critical value? This number should be on your printout. Be careful to get the degrees of freedom right. \item Did you locate any outliers? For each one (if there are any), give the values of concentration and reaction rate (numbers), and state whether the cells were treated or untreated. \end{enumerate} \end{enumerate} % End last question \textbf{Please bring your printout to the quiz.} \end{enumerate} % \vspace{30mm} \noindent \begin{center}\begin{tabular}{l} \hspace{6in} \\ \hline \end{tabular}\end{center} This assignment was prepared by \href{http://www.utstat.toronto.edu/~brunner}{Jerry Brunner}, Department of Statistical Sciences, University of Toronto. It is licensed under a \href{http://creativecommons.org/licenses/by-sa/3.0/deed.en_US} {Creative Commons Attribution - ShareAlike 3.0 Unported License}. Use any part of it as you like and share the result freely. The \LaTeX~source code is available from the course website: \href{http://www.utstat.toronto.edu/~brunner/oldclass/302f17} {\small\texttt{http://www.utstat.toronto.edu/$^\sim$brunner/oldclass/302f17}} \end{document} % Later