\documentclass[12pt]{article} %\usepackage{amsbsy} % for \boldsymbol and \pmb \usepackage{graphicx} % To include pdf files! \usepackage{amsmath} \usepackage{amsbsy} \usepackage{amsfonts} % for \mathbb{R} The set of reals \usepackage[colorlinks=true, pdfstartview=FitV, linkcolor=blue, citecolor=blue, urlcolor=blue]{hyperref} % For links \usepackage{fullpage} %\pagestyle{empty} % No page numbers \newcounter{Problem} % This is a good trick I just learned. I want to have some numbered questions, then a big section OUTSIDE the enumerate environment, describing a data set, and then continue the numbering in the next enumerate where I left off. I've just created a new counter called Problem. After the end of the first enumerate, I will save the value of the enumi counter in Problem with \setcounter{Problem}{\theenumi}, and then between \begin{enumerate} the first \item in the second enumerate, initialize the enumi counter to the value stored in Problem (instead of zero) with \setcounter{enumi}{\theProblem}. \begin{document} %\enlargethispage*{1000 pt} \begin{center} {\Large \textbf{STA 302f14 Assignment Six}}\footnote{Copyright information is at the end of the last page.} \vspace{1 mm} \end{center} \noindent See the formula sheet for the general linear model with normal error terms. You may use anything from the formula sheet unless you are explicitly asked to prove it, or are instructed otherwise. \begin{enumerate} \item In the general linear model with normal error terms, what is the distribution of $\mathbf{Y}$? \item You know that the least squares estimate of $\boldsymbol{\beta}$ is $\widehat{\boldsymbol{\beta}} = (\mathbf{X}^\prime \mathbf{X})^{-1} \mathbf{X}^\prime \mathbf{Y}$. What is the distribution of $\widehat{\boldsymbol{\beta}}$? Show the calculations. \item Let $\widehat{\mathbf{Y}}=\mathbf{X}\hat{\boldsymbol{\beta}}$. What is the distribution of $\widehat{\mathbf{Y}}$? Show the calculations. \item Let the vector of residuals $\hat{\boldsymbol{\epsilon}} = \mathbf{Y}-\widehat{\mathbf{Y}}$. What is the distribution of $\hat{\boldsymbol{\epsilon}}$? Show the calculations. Simplify both the expected value (which is zero) and the covariance matrix. \item Recall from an earlier homework problem that if $\mathbf{T}$ is a random vector with expected value $\boldsymbol{\mu}$, then $cov(\mathbf{T}) = E(\mathbf{TT}^\prime) - \boldsymbol{\mu\mu}^\prime$. Using this fact, give expressions for \begin{enumerate} \item $E(\mathbf{YY}^\prime)$ \item $E(\widehat{\boldsymbol{\beta}}\widehat{\boldsymbol{\beta}}^\prime)$ \end{enumerate} These may be helpful in the next question. \item For the general linear regression model, show that the $n \times (k+1)$ matrix of covariances $C(\hat{\boldsymbol{\epsilon}},\widehat{\boldsymbol{\beta}}) = \mathbf{0} $. Why does this show that $SSE = \hat{\boldsymbol{\epsilon}}^\prime\hat{\boldsymbol{\epsilon}}$ and $\widehat{\boldsymbol{\beta}}$ are independent? \item In an earlier Assignment, you proved that \begin{displaymath} (\mathbf{Y}-\mathbf{X}\boldsymbol{\beta})^\prime (\mathbf{Y}-\mathbf{X}\boldsymbol{\beta}) = (\mathbf{Y}-\mathbf{X}\widehat{\boldsymbol{\beta}})^\prime (\mathbf{Y}-\mathbf{X}\widehat{\boldsymbol{\beta}}) + (\widehat{\boldsymbol{\beta}}-\boldsymbol{\beta})^\prime (\mathbf{X^\prime X}) (\widehat{\boldsymbol{\beta}}-\boldsymbol{\beta}). \end{displaymath} Starting with this expression, show that $SSE/\sigma^2 \sim \chi^2(n-k-1)$. Use the formula sheet. \item The $t$ distribution is defined as follows. Let $Z\sim N(0,1)$ and $W \sim \chi^2(\nu)$, with $Z$ and $W$ independent. Then $T = \frac{Z}{\sqrt{W/\nu}}$ is said to have a $t$ distribution with $\nu$ degrees of freedom, and we write $T \sim t(\nu)$. For the general fixed effects linear regression model, tests and confidence intervals for linear combinations of regression coefficients are very useful. Derive the appropriate $t$ distribution and some applications by following these steps. Let $\mathbf{a}$ be a $p \times 1$ vector of constants. \begin{enumerate} \item What is the distribution of $\mathbf{a}^\prime \widehat{\boldsymbol{\beta}}$? Show a little work. Your answer includes both the expected value and the variance. \item Now standardize the difference (subtract off the mean and divide by the standard deviation) to obtain a standard normal. \item Divide by the square root of a well-chosen chi-squared random variable, divided by its degrees of freedom, and simplify. Call the result $T$. \item How do you know numerator and denominator are independent? \item Suppose you wanted to test $H_0: \mathbf{a}^\prime\boldsymbol{\beta} = c$. Write down a formula for the test statistic. \item For a regression model with four independent variables, suppose you wanted to test $H_0: \beta_2=0$. Give the vector $\mathbf{a}$. \item For a regression model with four independent variables, suppose you wanted to test $H_0: \beta_1=\beta_2$. Give the vector $\mathbf{a}$. \item Letting $t_{\alpha/2}$ denote the point cutting off the top $\alpha/2$ of the $t$ distribution with $n-k-1$ degrees of freedom, derive the $(1-\alpha) \times 100\%$ confidence interval for $\mathbf{a}^\prime\boldsymbol{\beta}$. ``Derive" means show the High School algebra. \end{enumerate} \item Letting $SST = \sum_{i=1}^n(Y_i-\overline{Y})^2$, $SSE = \sum_{i=1}^n(Y_i-\widehat{Y}_i)^2$ and $SSR = \sum_{i=1}^n(\widehat{Y}_i-\overline{Y})^2$, show $SST=SSR+SSE$. \item Show that $\overline{Y}$ is a function of $\widehat{\boldsymbol{\beta}}$. Why does this establish that $SSR$ and $SSE$ are independent? \item If $H_0: \beta_1 = \cdots = \beta_k = 0$ is true, \begin{enumerate} \item What is the distribution of $Y_i$? \item What is the distribution of $\frac{SST}{\sigma^2}$? Just write down the answer. You already did it in Assignment 2, and again in Assignment 5. \end{enumerate} \item Still assuming $H_0: \beta_1 = \cdots = \beta_k = 0$ is true, what is the distribution of $SSR/\sigma^2$? Use the formula sheet and show your work. \item Suppose $H_0: \beta_1 = \cdots = \beta_k = 0$ were \emph{false}. Would you expect $SSR$ to be bigger, or would you expect it to be smaller? Which one, and why? \item Recall the definition of the $F$ distribution. If $W_1 \sim \chi^2(\nu_1)$ and $W_2 \sim \chi^2(\nu_2)$ are independent, $F = \frac{W_1/\nu_1}{W_2/\nu_2} \sim F(\nu_1,\nu_2)$. How do you know $F = \frac{SSR/k}{SSE/(n-k-1)}$ has an $F$ distribution under $H_0: \beta_1 = \cdots = \beta_k = 0$? List the numbers of the questions that establish the necessary facts. \end{enumerate} \setcounter{Problem}{\theenumi} \vspace{20mm} % \arabic{Problem} \newpage \noindent This rest of this assignment uses the data file \href{http://www.utstat.toronto.edu/~brunner/302f14/code_n_data/hw/CensusTract.data} {\texttt{CensusTract.data}}, given in \emph{Applied Linear Statistical Models} (1996), by Neter et al.. The data are used here without permission. There is a link on the course home page in case the one in this document does not work. The cases (there are $n$ cases) are a sample of census tracts in the United States. For each census tract, the following variables are recorded. \vspace{5mm} \begin{tabular}{ll} \texttt{area} & Land area in square miles \\ \texttt{pop} & Population in thousands \\ \texttt{urban} & Percent of population in cities \\ \texttt{old} & Percent of population 65 or older \\ \texttt{docs} & Number of active physicians \\ \texttt{beds} & Number of hospital beds \\ \texttt{hs} & Percent of population 25 or older completing 12+ years of school \\ \texttt{labor} & Number of persons 16+ employed or looking for work \\ \texttt{income} & Total Total before tax income in millions of dollars \\ \texttt{crimes} & Total number of serious crimes reported by police \\ \texttt{region} & Region of the country: 1=NE, 2=NC, 3=S, 4=W \\ \end{tabular} % \vspace{5mm} % \noindent % In the analyses you do, the dependent variable is \texttt{crimes}, and \texttt{region} is excluded for now. That means there are $k=9$ potential independent variables. \begin{enumerate} \setcounter{enumi}{\theProblem} % Start with where we left off in the last enumerate. \item % \arabic{enumi}% \setcounter{enumi}{-1} First, fit\footnote{To ``fit" a model means to estimate the parameters.} a regression model with \texttt{crimes} as the dependent variable and just one independent variable: \texttt{pop}. \begin{enumerate} \item In plain, non-statistical language, what do you conclude from this analysis? The answer is something about population size and number of crimes. \item What proportion of the variation in number of crimes is explained by population size? The answer is a number between zero and 1. \end{enumerate} \textbf{Bring your printout to the quiz.} \item Based on that last analysis, we will create a new dependent variable called crime \emph{rate}, defined as number of crimes divided by population size. The \texttt{attach} function should help; type \texttt{help(attach)} at the R prompt. Now fit a new regression model in which crime rate is a function of \texttt{area}, \texttt{urban}, \texttt{old}, \texttt{docs}, \texttt{beds}, \texttt{hs}, \texttt{labor} and \texttt{income}. % Don't hesitate to type \texttt{help(lm)} to find out more, and do not hesitate to do additional Based on this model, \begin{enumerate} \item What is $k$? The answer is a number. % 8 \item What is $\widehat{\beta}_4$? The answer is a number. % 0.0042640 \item Give the test statistic, the degrees of freedom and the $p$-value for each of the following null hypotheses. The answers are numbers from your printout. \begin{enumerate} \item $H_0: \beta_1=\beta_2= \cdots = \beta_8 = 0$ \item $H_0: \beta_6=0$ \item $H_0: \beta_0=0$ \end{enumerate} \item What proportion of the variation in crime rate is explained by the independent variables in this model? The answer is a number. % 0.3214 \item What is the smallest value of $\widehat{\epsilon}_i$? The answer is a number. % \item What is the largest value of $\widehat{\epsilon}_i$? The answer is a number. % \item Look at the output of \texttt{summary}. For the first entry under ``\texttt{t value}" (that's \texttt{2.057}), what is the null hypothesis? The answer is a symbolic statement involving one or more Greek letters. % H0: beta0=0 \item Look at the $F$ test at the end of the \texttt{summary} output. What is the null hypothesis? The answer is a symbolic statement involving one or more Greek letters. % H0: beta1 = ... = beta8 = 0 \item Controlling for all the other variables in the model, is number of hospital beds related to crime rate? \begin{enumerate} \item Give the null hypothesis in symbols. \item Give the value of the test statistic. The answer is a number from your printout. \item Give the $p$-value. The answer is a number from your printout. \item Do you reject the null hypothesis at $\alpha = 0.05$? Answer Yes or No. \item Allowing for other variables, census regions with more hospital beds tend to have \underline{~~~~~~~} crime rates. % lower \end{enumerate} \item Controlling for all the other variables in the model, is number of physicians related to crime rate? \begin{enumerate} \item Give the null hypothesis in symbols. \item Give the value of the test statistic. The answer is a number from your printout. \item Give the $p$-value. The answer is a number from your printout. \item Do you reject the null hypothesis at $\alpha = 0.05$? Answer Yes or No. \item Allowing for other variables, census regions with more physicians tend to have \underline{~~~~~~~} crime rates. % higher \end{enumerate} \end{enumerate} \textbf{Bring your printout to the quiz.} \end{enumerate} \vspace{20mm} \noindent \begin{center}\begin{tabular}{l} \hspace{6in} \\ \hline \end{tabular}\end{center} This assignment was prepared by \href{http://www.utstat.toronto.edu/~brunner}{Jerry Brunner}, Department of Statistical Sciences, University of Toronto. It is licensed under a \href{http://creativecommons.org/licenses/by-sa/3.0/deed.en_US} {Creative Commons Attribution - ShareAlike 3.0 Unported License}. Use any part of it as you like and share the result freely. The \LaTeX~source code is available from the course website: \href{http://www.utstat.toronto.edu/~brunner/oldclass/302f14} {\small\texttt{http://www.utstat.toronto.edu/$^\sim$brunner/oldclass/302f14}} \end{document} % Next assignment \item The general linear model with normal error terms is $\mathbf{Y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon}$, the columns of $\mathbf{X}$ are linearly independent, and $\boldsymbol{\epsilon} \sim N_n(\mathbf{0},\sigma^2\mathbf{I}_n)$. You know that \begin{itemize} \item $\widehat{\boldsymbol{\beta}} = (\mathbf{X}^\prime \mathbf{X})^{-1} \mathbf{X}^\prime \mathbf{Y} \sim N_{k+1}\left(\boldsymbol{\beta}, \sigma^2 (\mathbf{X}^\prime \mathbf{X})^{-1}\right)$ \item $SSE/\sigma^2 = \hat{\boldsymbol{\epsilon}}^\prime \hat{\boldsymbol{\epsilon}}/\sigma^2 \sim \chi^2(n-k-1) $, independent of $\widehat{\boldsymbol{\beta}}$. \end{itemize} Derive the $(1-\alpha)\times 100\%$ prediction interval for a new observation from this population. For the census tract data, \item Predict the crime rate for a new census tract with an area of 2,500 square miles, 50 percent urban, 10 percent senior citizens, 2,000 doctors, 6,000 hospital beds, 50 percent finished high school, a labour force of 450 thousand, and a total income of 6,500 million dollars. Give both a predicted value (a single number) and a 95\% prediction interval.