\documentclass[11pt]{article} %\usepackage{amsbsy} % for \boldsymbol and \pmb \usepackage{graphicx} % To include pdf files! \usepackage{amsmath} \usepackage{amsbsy} \usepackage{amsfonts} % for \mathbb{R} The set of reals \usepackage[colorlinks=true, pdfstartview=FitV, linkcolor=blue, citecolor=blue, urlcolor=blue]{hyperref} % For links \usepackage{fullpage} %\pagestyle{empty} % No page numbers \begin{document} %\enlargethispage*{1000 pt} \begin{center} {\Large \textbf{STA 302f15 Assignment Nine}}\footnote{Copyright information is at the end of the last page.} \vspace{1 mm} \end{center} \noindent Problems \ref{corr} and \ref{twosample} are paper and pencil. They are preparation for the quiz in tutorial on Thursday November 12th, and are not to be handed in. Problem~\ref{computer} uses R. Please bring your printout for Problem~\ref{computer} to the quiz. Do not write anything on the printout in advance of the quiz, except possibly your name and student number. \begin{enumerate} \item \label{corr} The simple linear regression model is $Y_i = \beta_0 + \beta_1 x_i + \epsilon_i$ for $i=1, \ldots, n$, where $\epsilon_1, \ldots, \epsilon_n$ are a random sample from a distribution with expected value zero and variance $\sigma^2$. The numbers $x_1, \ldots, x_n$ are known, observed constants, while the parameters $\beta_0$ $\beta_1$ and $\sigma^2$ are unknown constants (parameters). In a previous homework (Assignment 4), you obtained \begin{displaymath} \widehat{\beta}_0 = \overline{y} - \widehat{\beta}_1 \overline{x} \mbox{ ~~~~~~and~~~~~~ } \widehat{\beta}_1 = \frac{\sum_{i=1}^n(x_i-\overline{x})(y_i-\overline{y})} {\sum_{i=1}^n(x_i-\overline{x})^2} \end{displaymath} Show that for this model, $R^2$ is the square of the correlation coefficient \begin{displaymath} r = \frac{\sum_{i=1}^n (x_i-\overline{x})(y_i-\overline{y})} {\sqrt{\sum_{i=1}^n (x_i-\overline{x})^2} \sqrt{\sum_{i=1}^n (y_i-\overline{y})^2}} . \end{displaymath} You may use anything on the formula sheet. \item \label{twosample} This is a good test of whether you understand how $t$ statistics are constructed. You may use the fact (a fact you have proved) that for a normal random sample, \begin{displaymath} \frac{(n-1)S^2}{\sigma^2} \sim \chi^2(n-1). \end{displaymath} Let $x_1, \dots, x_{n_1} \stackrel{i.i.d.}{\sim} N(\mu_1,\sigma^2)$, and $y_1, \dots, y_{n_2} \stackrel{i.i.d.}{\sim} N(\mu_2,\sigma^2)$. These two random samples are independent, meaning all the $x$ variables are independent of all of the $y$ variables. Every elementary Statistics text tells you that \begin{displaymath} T = \frac{\overline{x}-\overline{y} - (\mu_1-\mu_2)}{S_p \sqrt{\frac{1}{n_1} + \frac{1}{n_2}}} \sim t(n_1+n_2-2), \end{displaymath} where \begin{displaymath} S^2_p = \frac{\sum_{i=1}^{n_1}(x_i-\overline{x})^2 + \sum_{i=1}^{n_2}(y_i-\overline{y})^2} {n_1+n_2-2} \end{displaymath} This is the basis of tests and confidence intervals for $\mu_1-\mu_2$. \begin{enumerate} \item Prove that $T$ does indeed have the distribution claimed. Carefully cite material from the formula sheet when you use it. The word ``independent" should appear in your answer at least \emph{twice}. \item Suppose you wanted to test $H_0:\mu_1=\mu_2$. Give a formula for the test statistic. \item Derive a $(1-\alpha)100\%$ confidence interval for $\mu_1-\mu_2$. ``Derive" means show all the High School algebra. \end{enumerate} \newpage \item \label{computer} The \texttt{statclass} data were used in Assignment~6. At the R prompt, type {\scriptsize \begin{verbatim} statclass = read.table("http://www.utstat.utoronto.ca/~brunner/data/legal/LittleStatclassdata.txt") \end{verbatim} } % End size You now have access to the \texttt{statclass} data. Fit a regression model in which the dependent variable is mark on the final exam, and the independent variables are Quiz Average, Computer Average, and mark on the Midterm test. \begin{enumerate} \item What is the predicted Final Exam score for a student with a Quiz average of 8.5, a Computer average of 5, and a Midterm mark of 60\%? The answer is a number. Be able to do this kind of thing on the quiz with a calculator from the output of \texttt{summary}. \item For any fixed Quiz Average and Computer Average, a score one point higher on the Midterm yields a predicted mark on the Final Exam that is \underline{\hspace{10mm}} higher. \item For any fixed Quiz Average and Midterm score, an average one point higher on the Computer Average yields a predicted mark on the Final Exam that is \underline{\hspace{10mm}} higher. Or is it lower? \item What is $\widehat{\beta}_3$? The answer is a number from your printout. \item For each of the following null hypotheses, give the value of the test statistic and the $p$-value. These are numbers from your printout. Also state whether you reject $H_0$ at $\alpha = 0.05$. \begin{center} \begin{tabular}{|c|c|c|c|} \hline $H_0$ & Test Statistic & $p$-value & Reject $H_0$? \\ \hline $\beta_1 = \beta_2 = \beta_3 = 0$ & & & \\ \hline $\beta_0 = 0$ & & & \\ \hline $\beta_1 = 0$ & & & \\ \hline $\beta_2 = 0$ & & & \\ \hline $\beta_3 = 0$ & & & \\ \hline \end{tabular} \end{center} \item For each of the following questions, give the null hypothesis you tested to answer the question, and also a conclusion expressed in plain, non-statistical language. Remember the rules: No statistical terminology, draw a directional conclusion if you can, be guided by $\alpha=0.05$ but never mention it, and don't accept $H_0$. \begin{enumerate} \item Controlling for quiz average and computer average, is mark on the midterm test related to mark on the final exam? \item Allowing for mark on the midterm test and quiz average, is computer average a useful predictor of mark on the final exam? \item Taking into account mark on the midterm test and computer average, is quiz average related connected to mark on the final exam? \item Are any of the predictor variables useful? \end{enumerate} \item What proportion of the variation in final exam score is explained by the term work? The answer is a number from your printout. \item What is the largest $\widehat{\epsilon}_i$ in absolute value? The answer is on your printout. \item My printout has ``\texttt{Residual standard error: 14.54}." What is this number? \item What is MSE? The answer is a number you can get with a calculator from your output. \item What is $k$ for this problem? You can get it from the output of \texttt{summary}. \item What is $n$ for this problem? You can calculate it from the output of \texttt{summary} without a calculator. \item What are the dimensions of the $\mathbf{X}$ matrix? The answer is a pair of numbers. You can calculate them from the output of \texttt{summary} without a calculator. \item What are the dimensions of the $\widehat{\boldsymbol{\beta}}$ matrix? The answer is a pair of numbers. You can obtain them from the output of \texttt{summary} without a calculator. \item What are the dimensions of the $\widehat{\boldsymbol{\epsilon}}$ matrix? The answer is a pair of numbers. You can get them from the output of \texttt{summary} without a calculator. \item What are the dimensions of $\widehat{\boldsymbol{\epsilon}}^\prime \, \widehat{\boldsymbol{\epsilon}}$? \item What are the dimensions of the $\widehat{\mathbf{y}}$ matrix? The answer is a pair of numbers. \item What are the dimensions of the hat matrix $\mathbf{H}$? The answer is a pair of numbers. \item What is $\widehat{\boldsymbol{\epsilon}}^\prime \, \widehat{\boldsymbol{\epsilon}}$? You can calculate this number from the output of \texttt{summary} using a calculator, if you know what \texttt{Residual standard error} is. \item What is $SST$? The answer is a single number. You can check your work with R, but calculate the number based just on the output of \texttt{summary} and the formula sheet. First show your work (there is some algebra), and then obtain the result with a calculator. Circle your final answer. \item The tests and confidence intervals based on the $t$ distribution all use $t_{\alpha/2}$. By default we are using $\alpha = 0.05$, so $t_{\alpha/2}$ is the point cutting off the top 2.5\% of the $t$ distribution with $n-k-1$ degrees of freedom. Obtain this number with R and make sure it is included in your printout. \item With a calculator (or using R as a calculator) calculate a 95\% confidence interval for $\widehat{\beta}_3$. You can get the numbers you need from the output of \texttt{summary}. You don't need \texttt{vcov} for this one. You might want to refer to your answer to Question 6h from Assignment 8. \item For this question, first use the \texttt{attach} function to make the variables conveniently available for calculation. See the \emph{Least squares with R} handout. Then calculate the means of all the independent variables. You might as well calculate $\overline{y}$ as well. \begin{enumerate} \item First, give a point estimate of $E(y|x_1 = \overline{x}_1, x_2 = \overline{x}_2, x_3 = \overline{x}_3)$. There is an easy way and a hard way. You decide: the easy way, the hard way, or both because you like to double-check everything. \item Give a 95\% confidence interval for $E(y|x_1 = \overline{x}_1, x_2 = \overline{x}_2, x_3 = \overline{x}_3)$. For this, you will need to use \texttt{vcov}. Again, refer to your answer to Question 6h from Assignment~8. Your answer is a pair of numbers. You should do this with R and it should be on your printout. \end{enumerate} \end{enumerate} \end{enumerate} % \vspace{70mm} \noindent \begin{center}\begin{tabular}{l} \hspace{6in} \\ \hline \end{tabular}\end{center} This assignment was prepared by \href{http://www.utstat.toronto.edu/~brunner}{Jerry Brunner}, Department of Statistical Sciences, University of Toronto. It is licensed under a \href{http://creativecommons.org/licenses/by-sa/3.0/deed.en_US} {Creative Commons Attribution - ShareAlike 3.0 Unported License}. Use any part of it as you like and share the result freely. The \LaTeX~source code is available from the course website: \href{http://www.utstat.toronto.edu/~brunner/oldclass/302f15} {\small\texttt{http://www.utstat.toronto.edu/$^\sim$brunner/oldclass/302f15}} \end{document}