% 302f20Assignment5.tex \documentclass[11pt]{article} %\usepackage{amsbsy} % for \boldsymbol and \pmb \usepackage{graphicx} % To include pdf files! \usepackage{amsmath} \usepackage{amsbsy} \usepackage{amsfonts} \usepackage{comment} \usepackage{euscript} % for \EuScript \usepackage[colorlinks=true, pdfstartview=FitV, linkcolor=blue, citecolor=blue, urlcolor=blue]{hyperref} % For links \usepackage{fullpage} %\pagestyle{empty} % No page numbers \begin{document} %\enlargethispage*{1000 pt} \begin{center} {\Large \textbf{STA 302f20 Assignment Five}\footnote{This assignment was prepared by \href{http://www.utstat.toronto.edu/~brunner}{Jerry Brunner}, Department of Statistical Sciences, University of Toronto. It is licensed under a \href{http://creativecommons.org/licenses/by-sa/3.0/deed.en_US} {Creative Commons Attribution - ShareAlike 3.0 Unported License}. Use any part of it as you like and share the result freely. The \LaTeX~source code is available from the course website: \href{http://www.utstat.toronto.edu/~brunner/oldclass/302f20} {\small\texttt{http://www.utstat.toronto.edu/$^\sim$brunner/oldclass/302f20}}} } \vspace{1 mm} \end{center} \noindent The following problems are not to be handed in. They are preparation for the Quiz in tutorial and the final exam. Please try them before looking at the answers. Use the formula sheet. Please remember that the R parts (Questions~\ref{sat} and~\ref{faraway} are \emph{not group projects}. You may compare numerical answers, but do not show anyone your code or look at anyone else's. \begin{enumerate} \item For the general linear regression model in matrix form, find $E(\mathbf{y})$ and $cov(\mathbf{y})$. Show your work. \item What are the dimensions of the random vector $\widehat{\boldsymbol{\beta}}$ ? Give the number of rows and the number of columns. \item Is $\widehat{\boldsymbol{\beta}}$ an unbiased estimator of $\boldsymbol{\beta}$? Answer Yes or No and show your work. \item Calculate $cov(\widehat{\boldsymbol{\beta}})$ and simplify. Show your work. \item What are the dimensions of the random vector $\widehat{\mathbf{y}}$? \item What is $E(\widehat{\mathbf{y}})$? Show your work. \item What is $cov(\widehat{\mathbf{y}})$? Show your work. \item What are the dimensions of the matrix $\widehat{\boldsymbol{\epsilon}}$? \item What is $E(\widehat{\boldsymbol{\epsilon}})$? Show your work. Is $\widehat{\boldsymbol{\epsilon}}$ an unbiased estimator of $\boldsymbol{\epsilon}$? This is a trick question, and requires thought. \item What is $cov(\widehat{\boldsymbol{\epsilon}})$? Show your work. It is easier if you use $\mathbf{I}-\mathbf{H}$. \item \label{nox} This is the simplest case of the Gauss-Markov Theorem. Let $Y_1, \ldots, Y_n$ be independent random variables with $E(Y_i)=\mu$ and $Var(Y_i)=\sigma^2$ for $i=1, \ldots, n$. \begin{enumerate} \item Write down $E(\overline{Y})$ and $Var(\overline{Y})$. \item Let $c_1, \ldots, c_n$ be constants and define the linear combination $L$ by $L = \sum_{i=1}^n c_i Y_i$.Recall that $L$ unbiased means for $\mu$ that $E(L)=\mu$ for \emph{all} real $\mu$. Show that $L$ unbiased for $\mu$ implies $\sum_{i=1}^n c_i = 1$. \item Is $\overline{Y}$ a special case of $L$? If so, what are the $c_i$ values? \item What is $Var(L)$? \item Now show that $Var(\overline{Y}) \leq Var(L)$ for every unbiased $L$, with equality when $L = \overline{Y}$. Hint: $\sum_{i=1}^n c_i^2 = \sum_{i=1}^n \left( c_i - \frac{1}{n} + \frac{1}{n} \right)^2$. \end{enumerate} % \newpage %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \item \label{GM} For the general linear model $\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon}$, suppose we want to estimate the linear combination $\boldsymbol{\ell}^\prime\boldsymbol{\beta}$ based on sample data. The Gauss-Markov Theorem tells us that the most natural choice is also (in a sense) the best choice. This question leads you through an alternative proof of the Gauss-Markov Theorem, modelled on Question~\ref{nox}. \begin{enumerate} \item What is the most natural choice for estimating the linear combination $\boldsymbol{\ell}^\prime\boldsymbol{\beta}$? \item Show that this estimate is unbiased. \item The natural estimator is a \emph{linear} unbiased estimator of the form $\mathbf{c}_0^\prime \mathbf{y}$. What is the $n \times 1$ vector $\mathbf{c}_0$? \item Of course there are lots of other possible linear unbiased estimators of $\boldsymbol{\ell}^\prime\boldsymbol{\beta}$. They are all of the form $\mathbf{c}^\prime \mathbf{y}$; the natural estimator $\mathbf{c}_0^\prime \mathbf{Y}$ is just one of these. The best one is the one with the smallest variance, because its distribution is the most concentrated around the right answer. What is $Var(\mathbf{c}^\prime \mathbf{y})$? Show your work. \item We insist that $\mathbf{c}^\prime \mathbf{y}$ be unbiased. Show that if $E(\mathbf{c}^\prime \mathbf{y}) = \boldsymbol{\ell}^\prime\boldsymbol{\beta}$ for \emph{all} $\boldsymbol{\beta} \in \mathbb{R}^{k+1}$, we must have $\mathbf{X}^\prime\mathbf{c} = \boldsymbol{\ell}$. \item So, the task is to minimize $Var(\mathbf{c}^\prime \mathbf{y})$ by minimizing $\mathbf{c}^\prime\mathbf{c}$ over all $\mathbf{c}$ subject to the constraint $\mathbf{X}^\prime\mathbf{c} = \boldsymbol{\ell}$. As preparation for this, show $(\mathbf{c}-\mathbf{c}_0)^\prime\mathbf{c}_0 = 0$. \item Using the result of the preceding question, show \begin{displaymath} \mathbf{c}^\prime\mathbf{c} = (\mathbf{c}-\mathbf{c}_0)^\prime(\mathbf{c}-\mathbf{c}_0) + \mathbf{c}_0^\prime\mathbf{c}_0. \end{displaymath} \item Since the formula for $\mathbf{c}_0$ has no $\mathbf{c}$ in it, what choice of $\mathbf{c}$ minimizes the preceding expression? How do you know that the minimum is unique? \end{enumerate} The conclusion is that $\mathbf{c}_0^\prime \mathbf{y} = \boldsymbol{\ell}^\prime \widehat{\boldsymbol{\beta}}$ is the Best Linear Unbiased Estimator (BLUE) of $\boldsymbol{\ell}^\prime\boldsymbol{\beta}$. \item The model for simple regression through the origin is $y_i = \beta x_i + \epsilon_i$, where the $x_i$ are known constants and $\epsilon_1, \ldots, \epsilon_n$ are independent with expected value $0$ and variance $\sigma^2$. In previous homework, you found the least squares estimate of $\beta$ to be $\widehat{\beta} = \frac{\sum_{i=1}^n x_iy_i}{\sum_{i=1}^n x_i^2}$. \begin{enumerate} \item What is $Var(\widehat{\beta})$? \item Let $\widehat{\beta}_2 = \frac{\overline{y}_n}{\overline{x}_n}$. \begin{enumerate} \item Is $\widehat{\beta}_2$ an unbiased estimator of $\beta$? Answer Yes or No and show your work. \item Is $\widehat{\beta}_2$ a linear combination of the $y_i$ variables, of the form $L = \sum_{i=1}^n c_i y_i$? Is so, what is $c_i$? \item What is $Var(\widehat{\beta}_2)$? \item How do you know $Var(\widehat{\beta}) \leq Var(\widehat{\beta}_2)$? No calculations are necessary. \item Under what circumstances are the two variances equal? \end{enumerate} % \newpage %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \item Let $\widehat{\beta}_3 = \frac{1}{n}\sum_{i=1}^n \frac{y_i}{x_i} $. \begin{enumerate} \item Is $\widehat{\beta}_3$ an unbiased estimator of $\beta$? Answer Yes or No and show your work. \item Is $\widehat{\beta}_3$ a linear combination of the $y_i$ variables, of the form $L = \sum_{i=1}^n c_i y_i$? Is so, what is $c_i$? \item What is $Var(\widehat{\beta}_3)$? \item How do you know $Var(\widehat{\beta}) \leq Var(\widehat{\beta}_3)$? No calculations are necessary. \item Under what circumstances are the two variances equal? \end{enumerate} \end{enumerate} \item The set of vectors $\EuScript{V} = \{\mathbf{v} = \mathbf{X}\mathbf{a}: \mathbf{a} \in \mathbb{R}^{k+1}\}$ is the subset of $\mathbb{R}^{n}$ consisting of linear combinations of the columns of $\mathbf{X}$. That is, $\EuScript{V}$ is the space \emph{spanned} by the columns of $\mathbf{X}$. The least squares estimator $\widehat{\boldsymbol{\beta}} = (\mathbf{X}^\prime \mathbf{X})^{-1}\mathbf{X}^\prime\mathbf{y}$ was obtained by minimizing $(\mathbf{y}-\mathbf{X}\boldsymbol{\beta})^\prime(\mathbf{y}-\mathbf{X}\boldsymbol{\beta})$ over all $\boldsymbol{\beta} \in \mathbb{R}^{k+1}$. Thus, $\widehat{\mathbf{y}} = \mathbf{X}\widehat{\boldsymbol{\beta}}$ is the point in $\EuScript{V}$ that is \emph{closest} to the data vector $\mathbf{y}$. Geometrically, $\widehat{\mathbf{y}}$ is the \emph{projection} (shadow) of $\mathbf{y}$ onto $\EuScript{V}$. The hat matrix $\mathbf{H}$ is a \emph{projection matrix}. It projects the image on any point in $\mathbb{R}^{n}$ onto $\EuScript{V}$. Now we will test out several consequences of this idea. \begin{enumerate} \item The shadow of a point already in $\EuScript{V}$ should be right at the point itself. Show that if $\mathbf{v} \in \EuScript{V}$, then $\mathbf{Hv}= \mathbf{v}$. \item The vector of differences $\widehat{\boldsymbol{\epsilon}} = \mathbf{y} - \widehat{\mathbf{y}}$ should be perpendicular (at right angles) to each and every basis vector of $\EuScript{V}$. How is this related to the formula $\mathbf{X}^\prime \, \widehat{\boldsymbol{\epsilon}} = \mathbf{0}$? \item Show that the vector of residuals $\widehat{\boldsymbol{\epsilon}}$ is perpendicular to any $\mathbf{v} \in \EuScript{V}$. \item The picture on Slide 27 of Lecture Unit Eight (More Least Squares) suggests that the closest point to $\widehat{\boldsymbol{\epsilon}}$ in $\EuScript{V}$ should be $\mathbf{0}$. Is this true? Answer Yes or No and show your work. \item In the proof of the Gauss-Markov Theorem (see Question~\ref{GM}), $\mathbf{c}$ is a point in $\mathbb{R}^n$. Show that if $E(\mathbf{c}^\prime\mathbf{y}) = \boldsymbol{\ell}^\prime\boldsymbol{\beta}$ for all $\boldsymbol{\beta} \in \mathbb{R}^{k+1}$, then the closest point to $\mathbf{c}$ in $\EuScript{V}$ is $\mathbf{c}_0$. \end{enumerate} \item For the general linear regression model, suppose the error terms $\epsilon_i$ are independent and normally distributed. \begin{enumerate} \item In this case, what is the distribution of $y_i$? Just write down the answer without proof. \item Show that the maximum likelihood estimates of $\beta_0, \ldots, \beta_k$ are identical to the least squares estimates. This is an important result. \item Find the maximum likelihood estimate of $\sigma^2$. How does it compare to the general version of $s^2$ on the formula sheet? Is the MLE unbiased? \end{enumerate} % \newpage %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \item \label{sat} In the United States, admission to university is based partly on high school marks and recommendations, and partly on applicants' performance on a standardized multiple choice test called the Scholastic Aptitude Test (SAT). The SAT has two sub-tests, Verbal and Math. A university administrator selected a random sample of 200 applicants, and recorded the Verbal SAT, the Math SAT and first-year university Grade Point Average (GPA) for each student. The data are available at \begin{center} \href{http://www.utstat.toronto.edu/~brunner/data/legal/openSAT.data.txt} {\texttt{http://www.utstat.toronto.edu/$\sim$brunner/data/legal/openSAT.data.txt}}. \end{center} We seek to predict GPA from the two test scores. Please use R with \emph{matrix operations} to do the following. I found the \texttt{as.matrix} function to be useful, since I did not use \texttt{attach}. \begin{enumerate} \item Calculate $\widehat{\boldsymbol{\beta}}$. Display the answer, a set of three numbers. \item Predict first-year GPA for a student with a Verbal SAT of 600 and a Math SAT of 700. The answer is a number. Display it on your output. \item Calculate $\widehat{\mathbf{y}}$. You don't have to display all the numbers because there are 200 of them, but calculate and display the sample mean of the $\widehat{y}_i$. Compare $\overline{y}$, which you should also display. \item Calculate $\widehat{\boldsymbol{\epsilon}}$. Don't display them all, but compute their sample mean and display that. It is not \emph{exactly} what you expected. Why? \item Calculate and display the inner product of $\widehat{\mathbf{y}}$ and $\widehat{\boldsymbol{\epsilon}}$. \item Calculate and display the inner product of $\widehat{\boldsymbol{\epsilon}}$ with total SAT score, which is the sum of Verbal SAT and Math SAT. How did you know what to expect? \end{enumerate} You can check your answers using the \texttt{lm} function. It would be good to prepare a PDF with your complete R input and output, in case you need to hand it in with the quiz. \item \label{faraway} In Faraway's \emph{Linear models with R}, read Chapter One. The main lesson is that there is more to data analysis than the technical material we will cover in STA302. Then read Chapter 2, skipping Section 2.9 on Identifiability. The author seemingly was in a bit of a hurry when he wrote that part. The coverage of projections and the Gauss-Markov Theorem is unexpected in an R book, but by now you should be able to follow it. There are some handy methods for extracting information from R objects, say on page 22. Then, using the \texttt{lm} function, do Exercise 1 on page 25 and display the answers. The description of the data set is sketchy. The variables are \begin{itemize} \item \texttt{sex}: 0=male, 1=female \item \texttt{status}: Socioeconomic status score based on parents' occupation \item \texttt{income}: in pounds per week \item \texttt{verbal}: verbal score in words out of 12 correctly defined \item \texttt{gamble}: expenditure on gambling in pounds per year \end{itemize} For part (f), consider the \emph{difference} between a female and a male with identical values on all the other $x$ variables. There currently seems to be a problem with the \texttt{faraway} R package. You can obtain the teen gambling data from \begin{center} \href{http://www.utstat.toronto.edu/~brunner/data/legal/teengamb.data.txt} {\texttt{http://www.utstat.toronto.edu/$\sim$brunner/data/legal/teengamb.data.txt}}. \end{center} As in Question~\ref{sat}, please prepare a PDF with your complete R input and output, in case you need to hand it in with the quiz. \end{enumerate} % End of all the questions %\vspace{60mm} \end{document} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % \vspace{30mm} \hrule \vspace{30mm} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%