\documentclass[11pt]{article} %\usepackage{amsbsy} % for \boldsymbol and \pmb \usepackage{graphicx} % To include pdf files! \usepackage{amsmath} \usepackage{amsbsy} \usepackage{amsfonts} % for \mathbb{R} The set of reals \usepackage[colorlinks=true, pdfstartview=FitV, linkcolor=blue, citecolor=blue, urlcolor=blue]{hyperref} % For links \usepackage{fullpage} %\pagestyle{empty} % No page numbers \begin{document} %\enlargethispage*{1000 pt} \begin{center} {\Large \textbf{STA 302f13 Assignment Eight}}\footnote{Copyright information is at the end of the last page.} \vspace{1 mm} \end{center} \noindent This assignment assumes you are using the new \href{http://www.utstat.toronto.edu/~brunner/302f13/formulas/302f13Formulas1.pdf} {Formula sheet}. There is a link on the course home page in case the one in this document does not work. The formula sheet (or part of it) will be provided with the quiz. \begin{enumerate} \item \label{moresat} In an extended version of the SAT data, the independent variables are \begin{itemize} \item[$x_1=$] Verbal SAT score \item[$x_2=$] Math SAT score \item[$x_3=$] High school Grade Point Average \item[$x_4=$] Mother's education, in years \item[$x_5=$] Father's education, in years \item[$x_6=$] Total family income \end{itemize} The dependent variable is first-year university Grade Point Average (GPA) again. For each of the following questions, give the null hypothesis in the form of a statement about the $\beta$ values, and then give the $\mathbf{C}$ and $\mathbf{t}$ matrices in $H_0: \mathbf{C}\boldsymbol{\beta} = \mathbf{t}$. \begin{enumerate} \item Controlling for all other variables, is either Verbal SAT score or Math SAT score (or both) related to GPA? \item When you allow for all the other variables, is family income a useful predictor of GPA? \item Controlling for all other variables, does expected GPA change faster as a function of Verbal SAT, or does it change faster as a function of Math SAT? \item Once you correct for the two SAT scores and High School marks, do any of the family variables matter? \item Correcting for all other variables, does expected GPA change faster as a function of Mother's education, or does it change faster as a function of father's education? \item Holding all the other variables constant at fixed values, is Math SAT related to first-year university GPA? \end{enumerate} \item For each part of Question~\ref{moresat}, Give $E(Y)$ for the reduced model, and give $E(Y)$ for the full model. \item For the general linear model (see formula sheet), \begin{enumerate} \item What is the distribution of $\mathbf{C}\widehat{\boldsymbol{\beta}}$? Note $\mathbf{C}$ is $q \times (k+1)$. \item If $H_0: \mathbf{C}\boldsymbol{\beta} = \mathbf{t}$ is true, what is the distribution of $\frac{1}{\sigma^2}(\mathbf{C}\widehat{\boldsymbol{\beta}}-\mathbf{t})^\prime (\mathbf{C}(\mathbf{X}^\prime \mathbf{X})^{-1}\mathbf{C}^\prime)^{-1} (\mathbf{C}\widehat{\boldsymbol{\beta}}-\mathbf{t})$? \item What other facts on the formula sheet allow you to establish the $F$ distribution for the general linear test? The distribution is \emph{given} on the formula sheet, so of course you can't use that. In particular, how do you know numerator and denominator are independent? \end{enumerate} \item Suppose you need to test the null hypothesis that a \emph{single} linear combination of regression coefficients is equal to zero. That is, you want to test $H_0: \mathbf{a}^\prime\boldsymbol{\beta} = 0$. Referring to the formula sheet, verify that $F=T^2$. Show your work. \item Starting from the formula sheet, show that the $F$ test for comparing full and reduced models may be written \begin{displaymath} F = \left( \frac{a}{1-a} \right) \left( \frac{n-k-1}{q} \right), \end{displaymath} where $a = \frac{R^2-R^2_r}{1-R^2_r}$ Show your work. You may use $SST=SSR+SSE$ and $R^2 = \frac{SSR}{SST}$, which are not on the formula sheet (yet). \item That quantity denoted by $a$ in the last question has a useful interpretation. It's the proportion of \emph{remaining} variation in the dependent variable that is explained when the independent variables in the second set are added to the model. That is, the variables in the reduced model explain $R^2_r$, so they fail to explain $1-R^2_r$. Then the variables in the second set are added to the reduced model, yielding the full model --- and $R^2$ goes up. The quantity $a$ expresses this improvement as a proportion of what improvement was possible. Derive a formula for $a$, writing $a$ in terms of $F$, $n$, $k$ and $q$. Show your work. This formula can give an idea of how strong a set of results is, when all you are given is an $F$ or $t$ statistic and the degrees of freedom. \item This question uses the data file \href{http://www.utstat.toronto.edu/~brunner/302f13/code_n_data/hw/CensusTract.data} {\texttt{CensusTract.data}} from Assignment 7. There is a link on the course home page in case the one in this document does not work. Start with the model in which the dependent variable is crime rate, and the independent variables are \texttt{area}, \texttt{urban}, \texttt{old}, \texttt{docs}, \texttt{beds}, \texttt{hs}, \texttt{labor} and \texttt{income}. \begin{enumerate} \item According the the $t$-tests, the variables \texttt{old}, \texttt{labor} and \texttt{income} don't appear to be doing much. Test them simultaneously, the easiest way you can. Your R printout will include an $F$ statistic, degrees of freedom and $p$-value. What do you conclude? Is there a case for dropping these variables from the model? \item Do an $F$-test for percent of high school graduates, controlling for all other variables. Again, do it the easiest way you can. Compare the $p$-value to that of the the $t$-test. Does $F=T^2$? Are the test statistics (the specific numbers) equally informative? If not, which one tells you more? \end{enumerate} \end{enumerate} \vspace{20mm} \noindent \begin{center}\begin{tabular}{l} \hspace{6in} \\ \hline \end{tabular}\end{center} This assignment was prepared by \href{http://www.utstat.toronto.edu/~brunner}{Jerry Brunner}, Department of Statistical Sciences, University of Toronto. It is licensed under a \href{http://creativecommons.org/licenses/by-sa/3.0/deed.en_US} {Creative Commons Attribution - ShareAlike 3.0 Unported License}. Use any part of it as you like and share the result freely. The \LaTeX~source code is available from the course website: \href{http://www.utstat.toronto.edu/~brunner/oldclass/302f13} {\small\texttt{http://www.utstat.toronto.edu/$^\sim$brunner/oldclass/302f13}} \end{document} % This is the only question that does not require R. You may use the fact ``Census Tract data," \vspace{5mm} \noindent These problems are preparation for the quiz in tutorial on Friday November 1st, and are not to be handed in. # R work rm(list=ls()) census = read.table("http://www.utstat.toronto.edu/~brunner/302f13/code_n_data/hw/CensusTract.data") crimerate = crimes/pop fullmod = lm(crimerate ~ area + urban + old + docs + beds + hs + labor + income) summary(fullmod)