\documentclass[12pt]{article} %\usepackage{amsbsy} % for \boldsymbol and \pmb \usepackage{graphicx} % To include pdf files! \usepackage{amsmath} \usepackage{amsbsy} \usepackage{amsfonts} \usepackage[colorlinks=true, pdfstartview=FitV, linkcolor=blue, citecolor=blue, urlcolor=blue]{hyperref} % For links \usepackage{fullpage} %\pagestyle{empty} % No page numbers \begin{document} %\enlargethispage*{1000 pt} \begin{center} {\Large \textbf{STA 2101/442 Assignment Five}}\footnote{Copyright information is at the end of the last page.} \vspace{1 mm} \end{center} \noindent The questions are just practice for the quiz, and are not to be handed in. Use R for Questions~\ref{samplesize} and \ref{sat}, and bring two separate printouts to the quiz. \textbf{Your printouts should show \emph{all} R input and output, and \emph{only} R input and output}. Do not write anything on your printouts except your name and student number. \vspace{1mm} \begin{enumerate} \item \label{normalsample} Let $Y_1, \ldots, Y_n$ be a random sample from a $N(\mu,\sigma^2)$ distribution. The sample variance is $S^2 = \frac{\sum_{i=1}^n\left(Y_i-\overline{Y} \right)^2 }{n-1}$. \begin{enumerate} \item Show $Cov(\overline{Y},(Y_j-\overline{Y}))=0$ for any $j=1, \ldots, n$. \item How do you know that $\overline{Y}$ and $S^2$ are independent? \item Show that \begin{displaymath} \frac{(n-1)S^2}{\sigma^2} \sim \chi^2(n-1). \end{displaymath} Hint: $\sum_{i=1}^n\left(Y_i-\mu \right)^2 = \sum_{i=1}^n\left(Y_i-\overline{Y} + \overline{Y} - \mu \right)^2 = \ldots$ \end{enumerate} \item Recall the definition of the $t$ distribution. If $Z \sim N(0,1)$, $W \sim \chi^2(\nu)$ and $Z$ and $W$ are independent, then $T = \frac{Z}{\sqrt{W/\nu}}$ is said to have a $t$ distribution with $\nu$ degrees of freedom, and we write $W \sim t(\nu)$. As in the last question, let $Y_1, \ldots, Y_n$ be random sample from a $N(\mu,\sigma^2)$ distribution. Show that $T = \frac{\sqrt{n}(\overline{Y}-\mu)}{S} \sim t(n-1)$. The key is to locate a $Z$ for the numerator and a $W$ for the denominator. \item For the general fixed effects linear regression model in matrix form (see formula sheet), show that the $n \times p$ matrix of covariances $C(\mathbf{e},\widehat{\boldsymbol{\beta}}) = \mathbf{0}$. Why does this establish that $SSE$ and $\widehat{\boldsymbol{\beta}}$ are independent? \item Last week you showed that $(\mathbf{y}-X\boldsymbol{\beta})^\top (\mathbf{y}-X\boldsymbol{\beta}) = \mathbf{e}^\top\mathbf{e} + (\widehat{\boldsymbol{\beta}}-\boldsymbol{\beta})^\top (X^\top X) (\widehat{\boldsymbol{\beta}}-\boldsymbol{\beta})$. Dividing both sides by $\sigma^2$, show that $\mathbf{e}^\top\mathbf{e}/\sigma^2 \sim \chi^2(n-p)$. Start with the distribution of the left side. \item Tests and confidence intervals for linear combinations of regression coefficients are very useful. Derive the appropriate $t$ distribution and some applications by following these steps. Let $\mathbf{a}$ be a $p \times 1$ vector of constants. \begin{enumerate} \item What is the distribution of $\mathbf{a}^\top \widehat{\boldsymbol{\beta}}$? Show a little work. Your answer includes both the expected value and the variance. \item Now standardize the difference (subtract off the mean and divide by the standard deviation) to obtain a standard normal. \item Divide by the square root of a well-chosen chi-squared random variable, divided by its degrees of freedom, and simplify. Call the result $T$. \item How do you know numerator and denominator are independent? \item Suppose you wanted to test $H_0: \mathbf{a}^\top\boldsymbol{\beta} = c$. Write down a formula for the test statistic. \item Suppose you wanted to test $H_0: \beta_2=0$. Give the vector $\mathbf{a}$. \item Suppose you wanted to test $H_0: \beta_1=\beta_2$. Give the vector $\mathbf{a}$. \item Letting $t_{\alpha/2}$ denote the point cutting off the top $\alpha/2$ of the $t$ distribution with $n-p$ degrees of freedom, give a $(1-\alpha) \times 100\%$ confidence interval for $\mathbf{a}^\top\boldsymbol{\beta}$. \end{enumerate} \item In this question you will develop a \emph{prediction interval} (not a confidence interval) for $Y_{n+1}$. \begin{enumerate} \item What is the distribution of $Y_{n+1}-\widehat{Y}_{n+1} = Y_{n+1}-\mathbf{x}_{n+1}^\top \widehat{\boldsymbol{\beta}}$? Show your work. Your answer includes both the expected value and the variance. \item Now standardize the difference to obtain a standard normal. \item Divide by the square root of a chi-squared random variable, divided by its degrees of freedom, and simplify. Call it $T$. Compare your answer to a slide from lecture. How do you know that numerator and denominator are independent? \item Using your result, derive the $(1-\alpha)\times 100\%$ prediction interval for $Y_{n+1}$. \end{enumerate} \item In this question you will establish the $F$ distribution for the general linear test. For the general linear model (see formula sheet), \begin{enumerate} \item What is the distribution of $\mathbf{L}\widehat{\boldsymbol{\beta}}$? Note $\mathbf{L}$ is $r \times p$. \item If $H_0: \mathbf{L}\boldsymbol{\beta} = \mathbf{h}$ is true, what is the distribution of $\frac{1}{\sigma^2}(\mathbf{L}\widehat{\boldsymbol{\beta}}-\mathbf{h})^\top (\mathbf{L}(\mathbf{X}^\top \mathbf{X})^{-1}\mathbf{L}^\top)^{-1} (\mathbf{L}\widehat{\boldsymbol{\beta}}-\mathbf{h})$? Please locate support for your answer on the formula sheet. For full marks, don't forget the degrees of freedom. \item What other facts on the formula sheet allow you to establish the $F$ distribution for the general linear test? The distribution is \emph{given} on the formula sheet, so of course you can't use that. How do you know numerator and denominator are independent? \end{enumerate} \item Suppose you wish to test the null hypothesis that a \emph{single} linear combination of regression coefficients is equal to zero. That is, you want to test $H_0: \mathbf{a}^\top\boldsymbol{\beta} = 0$. Referring to earlier questions or the formula sheet, verify that $F=T^2$. Show your work. \pagebreak \item For the general linear regression model with normal error terms, show that if the model has an intercept, $\mathbf{e}$ and $\overline{y}$ are independent. Here are some ingredients to start you out. For the model with intercept, \begin{enumerate} \item What does $X^\prime\mathbf{e} = \mathbf{0}$ tell you about $\sum_{i=1}^n e_i$? \item Therefore what do you know about $\sum_{i=1}^n y_i$ and $\sum_{i=1}^n \widehat{y}_i$? \item Show that the least squares plane must pass through the point $(\overline{x}_1, ,\ldots \overline{x}_{p-1}, \overline{y})$. \item Now show that $\mathbf{e}$ and $\overline{y}$ are independent. \end{enumerate} \item Continue assuming that the regression model has an intercept. Many statistical programs automatically provide an \emph{overall} test that says none of the independent variables makes any difference. If you can't reject that, you're in trouble. If $H_0: \beta_1 = \cdots = \beta_{p-1} = 0$ is true, \begin{enumerate} \item What is the distribution of $Y_i$? \item What is the distribution of $\frac{SST}{\sigma^2}$? Just write down the answer. Check Problem~\ref{normalsample}. \end{enumerate} \item Still assuming $H_0: \beta_1 = \cdots = \beta_{p-1} = 0$ is true, find the distribution of $SSR/\sigma^2$? Use the formula sheet and show your work. \item \label{Fstat} Recall the definition of the $F$ distribution. If $W_1 \sim \chi^2(\nu_1)$ and $W_2 \sim \chi^2(\nu_2)$ are independent, $F = \frac{W_1/\nu_1}{W_2/\nu_2} \sim F(\nu_1,\nu_2)$. Show that $F = \frac{SSR/(p-1)}{SSE/(n-p)}$ has an $F$ distribution under $H_0: \beta_1 = \cdots = \beta_{p-1} = 0$? Refer to the results of questions above as you use them. \item The null hypothesis $H_0: \beta_1 = \cdots = \beta_{p-1} = 0$ is less and less believable as $R^2$ becomes larger. Show that the $F$ statistic of Question~\ref{Fstat} is an increasing function of $R^2$ for fixed $n$ and $k$. This mean it makes sense to reject $H_0$ for large values of $F$. \item When you fit a full and a reduced regression model, the proportion of remaining variation explained by the additional variables in the full model is $a = \frac{R^2_F-R^2_R}{1-R^2_R}$. \begin{enumerate} \item Show \begin{displaymath} F = \frac{(SSR_F-SSR_R)/r}{MSE_F} = \left(\frac{n-p}{r}\right) \left(\frac{a}{1-a}\right). \end{displaymath} \item Show $a = \frac{rF}{n-p+rF}$. This means that you can calculate the proportion of remaining variation for any $F$ or $t$-test without explicitly fitting a reduced model. All you need is a calculator. \end{enumerate} \item \label{samplesize} For a regression model with nine explanatory variables, you want the test of $H_0: \beta_1=\beta_2=\beta_3=\beta_4=0$ to be statistically significant at the $\alpha=0.05$ level provided that the variables $x_1$ through $x_4$ explain at least 3\% of the remaining variation. What sample size is required? The answer is a number from your R printout. \item In the usual univariate multiple regression model, the $\mathbf{X}$ is an $n \times p$ matrix of known constants. But of course in practice, the explanatory variables are random, not fixed. Clearly, if the model holds \emph{conditionally} upon the values of the explanatory variables, then all the usual results hold, again conditionally upon the particular values of the explanatory variables. The probabilities (for example, $p$-values) are conditional probabilities, and the $F$ statistic does not have an $F$ distribution, but a conditional $F$ distribution, given $\mathbf{X=x}$. \begin{enumerate} \item Show that the least-squares estimator $\widehat{\boldsymbol{\beta}}= (\mathbf{X}^{\top}\mathbf{X})^{-1}\mathbf{X}^{\top}\mathbf{y}$ is conditionally unbiased. \item Show that $\widehat{\boldsymbol{\beta}}$ is also unbiased unconditionally. Use double expectation. \item A similar calculation applies to the significance level of a hypothesis test. Let $F$ be the test statistic (say for an $F$-test comparing full and reduced models), and $f_c$ be the critical value. If the null hypothesis is true, then the test is size $\alpha$, conditionally upon the explanatory variable values. That is, $P(F>f_c|\mathbf{X=x})=\alpha$. Find the \emph{unconditional} probability of a Type I error. Assume that the explanatory variables are discrete, so you can write a multiple sum. \end{enumerate} \item \label{sat} For this question, you will use the \href{http://www.utstat.toronto.edu/~brunner/data/legal/openSAT.data.txt} {\texttt{sat.data}} again. Get the data with {\footnotesize \begin{verbatim} sat = read.table("http://www.utstat.toronto.edu/~brunner/data/legal/openSAT.data.txt"). \end{verbatim} } % End size We seek to predict GPA from the two test scores. Throughout, please use the usual $\alpha=0.05$ significance level. \begin{enumerate} \item First, fit a model using just the Math score as a predictor. ``Fit" means estimate the model parameters. Does there appear to be a relationship between Math score and grade point average? \begin{enumerate} \item Answer Yes or No. \item Fill in the blank. Students who did better on the Math test tended to have \underline{~~~~~~~~~~~} first-year grade point average. \item Do you reject $H_0: \beta_1=0$? \item Are the results statistically significant? Answer Yes or No. \item What is the $p$-value? The answer can be found in \emph{two} places on your printout. \item What proportion of the variation in first-year grade point average is explained by score on the SAT Math test? The answer is a number from your printout. \item Give a predicted first-year grade point average and a 95\% prediction interval for a student who got 700 on the Math SAT. \end{enumerate} \newpage \item Now fit a model with both the Math and Verbal sub-tests. \begin{enumerate} \item Give the test statistic, the degrees of freedom and the $p$-value for each of the following null hypotheses. The answers are numbers from your printout. \begin{enumerate} \item $H_0: \beta_1=\beta_2=0$ \item $H_0: \beta_1=0$ \item $H_0: \beta_2=0$ \item $H_0: \beta_0=0$ \end{enumerate} \item Controlling for Math score, is Verbal score related to first-year grade point average? \begin{enumerate} \item Give the null hypothesis in symbols. \item Give the value of the test statistic. The answer is a number from your printout. \item Give the $p$-value. The answer is a number from your printout. \item Do you reject the null hypothesis? \item Are the results statistically significant? Answer Yes or No. \item In plain, non-statistical language, what do you conclude? The answer is something about test scores and grade point average. \end{enumerate} \item Controlling for Verbal score, is Math score related to first-year grade point average? \begin{enumerate} \item Give the null hypothesis in symbols. \item Give the value of the test statistic. The answer is a number from your printout. \item Give the $p$-value. The answer is a number from your printout. \item Do you reject the null hypothesis? \item Are the results statistically significant? Answer Yes or No. \item In plain, non-statistical language, what do you conclude? The answer is something about test scores and grade point average. \end{enumerate} \item Math score explains \underline{~~~~~~~} percent of the remaining variation in grade point average once you take Verbal score into account. Using the formula from the slides (see formula sheet), you should be able to calculate this from the output of the \texttt{summary} function. You can check your answer using the \texttt{anova} function. \item Verbal score explains \underline{~~~~~~~} percent of the remaining variation in grade point average once you take Math score into account. Using the formula from the slides (see formula sheet), you should be able to calculate this from the output of the \texttt{summary} function. You can check your answer using the \texttt{anova} function. \newpage \item Give a predicted first-year grade point average and a 95\% prediction interval for a student who got 650 on the Verbal and 700 on the Math SAT. Are you confident that this student's first-year GPA will be above 2.0 (a C average)? \item Let's do one more test. We want to know whether expected GPA increases faster as a function of the Verbal SAT, or the Math SAT. That is, we want to compare the regression coefficients, testing $H_0: \beta_1=\beta_2$. \begin{enumerate} \item Express the null hypothesis in matrix form as $\mathbf{L}\boldsymbol{\beta} = \mathbf{h}$. \item Carry out a two-sided $t$-test. \item Carry out an $F$ test, the easy way. Does $F=t^2$? \item State your conclusion in plain, non-technical language. It's something about first-year grade point average. % Can't conclude that expected GPA increases at different rates as a function of Verbal SAT and Math SAT. \end{enumerate} \end{enumerate} Bring your printout to the quiz. \end{enumerate} \end{enumerate} \vspace{50mm} \noindent \begin{center}\begin{tabular}{l} \hspace{6in} \\ \hline \end{tabular}\end{center} This assignment was prepared by \href{http://www.utstat.toronto.edu/~brunner}{Jerry Brunner}, Department of Statistics, University of Toronto. It is licensed under a \href{http://creativecommons.org/licenses/by-sa/3.0/deed.en_US} {Creative Commons Attribution - ShareAlike 3.0 Unported License}. Use any part of it as you like and share the result freely. The \LaTeX~source code is available from the course website: \href{http://www.utstat.toronto.edu/~brunner/oldclass/appliedf16} {\texttt{http://www.utstat.toronto.edu/$^\sim$brunner/oldclass/appliedf16}} \end{document} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%