\documentclass[11pt]{article} %\usepackage{amsbsy} % for \boldsymbol and \pmb \usepackage{graphicx} % To include pdf files! \usepackage{amsmath} \usepackage{amsbsy} \usepackage{amsfonts} % for \mathbb{R} The set of reals \usepackage[colorlinks=true, pdfstartview=FitV, linkcolor=blue, citecolor=blue, urlcolor=blue]{hyperref} % For links \usepackage{fullpage} %\pagestyle{empty} % No page numbers \begin{document} %\enlargethispage*{1000 pt} \begin{center} {\Large \textbf{STA 302f16 Assignment Eight}}\footnote{Copyright information is at the end of the last page.} \vspace{1 mm} \end{center} \noindent Except for Problem~\ref{computer}, these problems are preparation for the quiz in tutorial on Thursday November 10th, and are not to be handed in. As usual, sometimes you may be asked to prove things that are false. Please bring your printout for Problem~\ref{computer} to the quiz. Do not write anything on the printout in advance of the quiz, except possibly your name and student number. \begin{enumerate} \item For the general linear regression model, assume that $n>k+1$ and that the columns of $X$ are linearly independent, so that $(X^\prime X)^{-1}$ exists and $\mathbf{b}$ is well defined. Starting from the definition on the formula sheet, prove that $\mathbf{e} = \mathbf{0}$.. \item Deleted \item Deleted %\newpage \item \label{computer} The \texttt{statclass} data were used in Assignment~5. At the R prompt, type {\scriptsize \begin{verbatim} statclass = read.table("http://www.utstat.utoronto.ca/~brunner/data/legal/LittleStatclassdata.txt") \end{verbatim} } % End size You now have access to the \texttt{statclass} data. Fit a regression model in which the dependent variable is mark on the final exam, and the independent variables are Quiz Average, Computer Average, and mark on the Midterm test. Please use this variable order in your R program. \begin{enumerate} \item What is the predicted Final Exam score for a student with a Quiz average of 8.5, a Computer average of 5, and a Midterm mark of 60\%? The answer is a number. Be able to do this kind of thing on the quiz with a calculator from the output of \texttt{summary}. \item For any fixed Quiz Average and Computer Average, a score one point higher on the Midterm yields a predicted mark on the Final Exam that is \underline{\hspace{10mm}} higher. \item For any fixed Quiz Average and Midterm score, an average one point higher on the Computer Average yields a predicted mark on the Final Exam that is \underline{\hspace{10mm}} higher. Or is it lower? \item What is $b_3$? The answer is a number from your printout. \item For each of the following null hypotheses, give the value of the test statistic and the $p$-value. These are numbers from your printout. Also state whether you reject $H_0$ at $\alpha = 0.05$. \begin{center} \begin{tabular}{|c|c|c|c|} \hline $H_0$ & Test Statistic & $p$-value & Reject $H_0$? \\ \hline $\beta_1 = \beta_2 = \beta_3 = 0$ & & & \\ \hline $\beta_0 = 0$ & & & \\ \hline $\beta_1 = 0$ & & & \\ \hline $\beta_2 = 0$ & & & \\ \hline $\beta_3 = 0$ & & & \\ \hline \end{tabular} \end{center} \item For each of the following questions, give the null hypothesis you tested to answer the question, and also a conclusion expressed in plain, non-statistical language. Remember the rules: No statistical terminology, draw a directional conclusion if you can, be guided by $\alpha=0.05$ but never mention it, and don't accept $H_0$. \begin{enumerate} \item Controlling for quiz average and computer average, is mark on the midterm test related to mark on the final exam? \item Allowing for mark on the midterm test and quiz average, is computer average a useful predictor of mark on the final exam? \item Taking into account mark on the midterm test and computer average, is quiz average related connected to mark on the final exam? \item Are any of the predictor variables useful? \end{enumerate} \item Controlling for mark on the midterm tests, are the other two variables (either or both) related to mark on the Final exam? \begin{enumerate} \item State the null hypothesis in terms of scalar $\beta$ values. \item State the null hypothesis in matrix terms. That is, give the matrices $C$, $\boldsymbol{\beta}$ and $\boldsymbol{\gamma}$ in $H_0: C\boldsymbol{\beta} = \boldsymbol{\gamma}$. \item Write the reduced model. Please do not re-number the variables are $\beta_j$ parameters. \item Give the value of the test statistic $F$. It is a number from your printout, but not part of the \texttt{summary} output. \item Give the $p$-value. The answer is a number from your printout. \item Do you reject $H_0$ at $\alpha = 0.05$? Answer Yes or No. \item Are the results statistically significant at the $\alpha = 0.05$ level? Answer Yes or No. \item Allowing for mark on the midterm test, what proportion of the remaining variation in final exam score is explained by computer average and quiz average? \item State your conclusions (if any) in plain, non-statistical language. \end{enumerate} \item What proportion of the variation in final exam score is explained by the term work? The answer is a number from your printout. \item Controlling for Quiz average and Computer average, what proportion of the \emph{remaining} variation in Final Exam score is explained by score on the Midterm test? The answer is a number that you could obtain with a calculator from the output of \texttt{summary}. \item What is the largest $e_i$ in absolute value? The answer is on your printout. \item What is $k$ for this problem? You can get it from the output of \texttt{summary}. \item What is $n$ for this problem? You can calculate it from the output of \texttt{summary} without a calculator. \item What are the dimensions of the $\mathbf{X}$ matrix? The answer is a pair of numbers, number of rows and number of columns. You can calculate them from the output of \texttt{summary} without a calculator. \item What are the dimensions of $\mathbf{b}$? The answer is a pair of numbers, number of rows and number of columns. You can obtain them from the output of \texttt{summary} without a calculator. \item What are the dimensions of $\mathbf{e}$? The answer is a pair of numbers, number of rows and number of columns. You can obtain them from the output of \texttt{summary} without a calculator. \item What are the dimensions of $\mathbf{e}^\prime\mathbf{e}$? \item What are the dimensions of the $\widehat{\mathbf{y}}$ matrix? The answer is a pair of numbers, number of rows and number of columns. \item What are the dimensions of the hat matrix $H$? The answer is a pair of numbers, number of rows and number of columns. \item What is $\mathbf{e}^\prime\mathbf{e}$? You can calculate this number from the output of \texttt{summary} using a calculator, using the fact that \texttt{Residual standard error} from your printout is the square root of $s^2 = \mathbf{e}^\prime\mathbf{e}/(n-k-1)$. \item What is $SST$? The answer is a single number. You can check your work with R, but calculate the number based just on the output of \texttt{summary} and the formula sheet. First show your work (there is some algebra), and then obtain the result with a calculator. Circle your final answer. \item The tests and confidence intervals based on the $t$ distribution all use $t_{\alpha/2}$. By default we are using $\alpha = 0.05$, so $t_{\alpha/2}$ is the point cutting off the top 2.5\% of the $t$ distribution with $n-k-1$ degrees of freedom. Obtain this number with R and make sure it is included in your printout. \item With a calculator (or using R as a calculator) calculate a 95\% confidence interval for $b_3$. You can get the numbers you need from the output of \texttt{summary}. You don't need \texttt{vcov} for this one. \item For this question, first use the \texttt{attach} function to make the variables conveniently available for calculation. See the \emph{Least squares with R} handout. Then calculate the means of all the independent variables. You might as well calculate $\overline{y}$ as well. \begin{enumerate} \item First, give a point estimate of $E(y|x_1 = \overline{x}_1, x_2 = \overline{x}_2, x_3 = \overline{x}_3)$. There is an easy way and a hard way. You decide: the easy way, the hard way, or both because you like to double-check everything. \item Give a 95\% confidence interval for $E(y|x_1 = \overline{x}_1, x_2 = \overline{x}_2, x_3 = \overline{x}_3)$. For this, you will need to use \texttt{vcov}. Your answer is a pair of numbers. You should do this with R and it should be on your printout. \end{enumerate} \end{enumerate} \item The U.S. Census Bureau divides the United States into small pieces called census tracts; lots of information is collected about each census tract. The census tracts are grouped into four geographic regions: Northeast, North Central, South and West. In one study, the cases were census tracts, the explanatory variables were Region and average income, and the response variable was crime rate, defined as the number of reported serious crimes in a census tract, divided by the number of people in the census tract. \begin{enumerate} \item Write $E(Y|x)$ for a regression model with parallel regression lines. You do not have to say how your dummy variables are defined. You will do that in the next part. \item Make a table showing how your dummy variables are set up. There should be one row for each region, and a column for each dummy variable. Add a wider column on the right, in which you show $E(Y|x)$. Note that the \emph{symbols} for your dummy variables will not appear in this column. There are examples of this format in the lecture slides. \item For each of the following questions, give the null hypothesis in terms of the $\beta$ parameters of your regression model. We are not doing one-tailed tests, regardless of how the question is phrased. \begin{enumerate} \item Controlling for average income, does average crime rate differ by geographic region? \item Allowing for average income, is average crime rate different in the Northeast and North Central regions? \item Controlling for average income, is average crime rate different in the Northeast and Western regions? \item Correcting for average income, is the crime rate in the South more than the average of the other three regions? \item Holding average income constant, is the average of the crime rates in the Northeast and North Central regions different from the average of the crime rates in the South and West? \item Controlling for geographic region, is crime rate connected to average income? \end{enumerate} \end{enumerate} \end{enumerate} \vspace{30mm} \noindent \begin{center}\begin{tabular}{l} \hspace{6in} \\ \hline \end{tabular}\end{center} This assignment was prepared by \href{http://www.utstat.toronto.edu/~brunner}{Jerry Brunner}, Department of Statistical Sciences, University of Toronto. It is licensed under a \href{http://creativecommons.org/licenses/by-sa/3.0/deed.en_US} {Creative Commons Attribution - ShareAlike 3.0 Unported License}. Use any part of it as you like and share the result freely. The \LaTeX~source code is available from the course website: \href{http://www.utstat.toronto.edu/~brunner/oldclass/302f16} {\small\texttt{http://www.utstat.toronto.edu/$^\sim$brunner/oldclass/302f16}} \end{document} % Later \item Based on the general linear model with normal error terms, \begin{enumerate} \item Prove the $t$ distribution given on the formula sheet for a new observation $y_0$. Use earlier material on the formula sheet. For example, how do you know numerator and denominator are independent? \item Derive the $(1-\alpha)\times 100\%$ prediction interval for a new observation from this population, in which the independent variable values are given in $\mathbf{x}_0$. ``Derive" means show the High School algebra. \end{enumerate} \item Suppose you have a random sample from a normal distribution, say $y_1, \ldots, y_n \stackrel{i.i.d.}{\sim} N(\mu,\sigma^2)$. If someone randomly sampled another observation from this population and asked you to guess what it was, there is no doubt you would say $\overline{y}$, and a confidence interval for $\mu$ is routine. But what if you were asked for a \emph{prediction} interval for a \emph{new} observation? Accordingly, suppose the normal model is reasonable and you observe a sample mean of $\overline{y} = 7.5$ and a sample variance (with $n-1$ in the denominator) of $s^2=3.82$. The sample size is $n=14$. Give a $95\%$ prediction interval for the next observation. The answer is a pair of numbers. Be able to show your work. You can get the distribution result you need from the formula sheet, or you can re-derive it for this special case. Be able to do it both ways. You should use R to get the critical value, but don't bother to bring your R printout for this question.