\documentclass[12pt]{article} %\usepackage{amsbsy} % for \boldsymbol and \pmb \usepackage{graphicx} % To include pdf files! \usepackage{amsmath} \usepackage{amsbsy} \usepackage{amsfonts} \usepackage[colorlinks=true, pdfstartview=FitV, linkcolor=blue, citecolor=blue, urlcolor=blue]{hyperref} % For links \usepackage{fullpage} %\pagestyle{empty} % No page numbers \begin{document} %\enlargethispage*{1000 pt} \begin{center} {\Large \textbf{STA 2101/442 Assignment Seven}}\footnote{Copyright information is at the end of the last page.} \vspace{1 mm} \end{center} \noindent Please bring your R printouts to the quiz. \emph{Your printouts must \emph{not} contain answers to the non-computer parts of this assignment}, The non-computer questions are just practice for the quiz, and will not be handed in. %, though you may use R as a calculator. % Bring a real calculator to the quiz. % Check for prime and k epsilon-hat C q response \begin{enumerate} %%%%%%%%%%%%%%%%%%%%%%%%%%%%% More Regression %%%%%%%%%%%%%%%%%%%%%%%%% \item Show that for a simple regression with an intercept and one explanatory variable, $R^2=r^2$, where $r = \frac{\sum_{i=1}^n (x_i-\overline{x})(y_i-\overline{y})} {\sqrt{\sum_{i=1}^n (x_i-\overline{x})^2} \sqrt{\sum_{i=1}^n (Y_i-\overline{Y})^2}}$ is the ordinary sample correlation coefficient. You may also use the formulas $\widehat{\beta}_1 = \frac{\sum_{i=1}^n (x_i-\overline{x})(Y_i-\overline{Y})} {\sum_{i=1}^n (x_i-\overline{x})^2}$ and $\widehat{\beta}_0 = \overline{Y} - \widehat{\beta}_1\overline{x}$. It helps to start with the formula for $R^2$, and then substitute for $\widehat{Y}_i$ right away. \item Consider a linear regression model with $n>p$, which is always the case in practice. Since the vector of residuals $\mathbf{e} \sim N\left(\mathbf{0},\sigma^2(\mathbf{I}-\mathbf{H})\right)$, it is tempting to write \break $\frac{1}{\sigma^2}\mathbf{e}^\top (\mathbf{I}-\mathbf{H})^{-1} \mathbf{e} \sim \chi^2(n)$. Please locate support for this idea on the formula sheet. But it only works if the $n \times n$ matrix $\mathbf{I}-\mathbf{H}$ has an inverse. \begin{enumerate} \item The rank of a product is the minimum of ranks. Why does this tell you that that the hat matrix $\mathbf{H}$ has no inverse? If you don't remember what the rank of a matrix is, look it up. \item But to me, it's not so obvious for $\mathbf{I}-\mathbf{H}$. Calculate $(\mathbf{I}-\mathbf{H}) \, \mathbf{X}(\mathbf{X}^\top \mathbf{X})^{-1}$. At this point the answer will be clear to some people, but not everybody. Continue by assuming that $(\mathbf{I}-\mathbf{H})^{-1}$ exists. If it does, you arrive at a conclusion that is impossible. Complete the proof. \end{enumerate} \item For the general linear model (see formula sheet), \begin{enumerate} \item What is the distribution of $\mathbf{L}\widehat{\boldsymbol{\beta}}$? Note $\mathbf{L}$ is $r \times p$. \item If $H_0: \mathbf{L}\boldsymbol{\beta} = \mathbf{h}$ is true, what is the distribution of $\frac{1}{\sigma^2}(\mathbf{L}\widehat{\boldsymbol{\beta}}-\mathbf{h})^\top (\mathbf{L}(\mathbf{X}^\top \mathbf{X})^{-1}\mathbf{L}^\top)^{-1} (\mathbf{L}\widehat{\boldsymbol{\beta}}-\mathbf{h})$? Please locate support for your answer on the formula sheet. For full marks, don't forget the degrees of freedom. \item What other facts on the formula sheet allow you to establish the $F$ distribution for the general linear test? The distribution is \emph{given} on the formula sheet, so of course you can't use that. How do you know numerator and denominator are independent? \end{enumerate} \item Suppose you wish to test the null hypothesis that a \emph{single} linear combination of regression coefficients is equal to zero. That is, you want to test $H_0: \mathbf{a}^\top\boldsymbol{\beta} = 0$. Referring to the formula sheet, verify that $F=T^2$. Show your work. \newpage \item Suppose you fit (estimate the parameters of) a regression model, obtaining $\widehat{\boldsymbol{\beta}}$, $\widehat{\mathbf{Y}}$ and $\mathbf{e}$. Call this Model One. \begin{enumerate} \item Then just for fun, you fit a second regression model, using $\widehat{\mathbf{Y}}$ from Model One as the response variable, and exactly the same $\mathbf{X}$ matrix as Model One. Call this Model Two. \begin{enumerate} \item What is $\widehat{\boldsymbol{\beta}}$ for Model Two? Show your work and simplify. \item What is $\widehat{\mathbf{Y}}$ for Model Two? Show your work and simplify. \item What is $\mathbf{e}$ for Model Two? Show your work and simplify. \item What is $MSE$ for Model Two? \end{enumerate} \item Now you fit a \emph{third} regression model, this time using $\mathbf{e}$ from Model One as the response variable, and again, exactly the same $\mathbf{X}$ matrix as Model One. Call this Model Three. \begin{enumerate} \item What is $\widehat{\boldsymbol{\beta}}$ for Model Three? Show your work and simplify. \item What is $\widehat{\mathbf{Y}}$ for Model Three? Show your work and simplify. \item What is $\mathbf{e}$ for Model Three? Show your work and simplify. \end{enumerate} \end{enumerate} \item For the usual multiple linear regression model with normal error terms, you already know that $\mathbf{e} \sim N(\mathbf{0},\sigma^2(\mathbf{I}-\mathbf{H}))$, where $\mathbf{H} = \mathbf{X}(\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top $. Let $\mathbf{Z} = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \, \mathbf{e}$. \begin{enumerate} \item Find the distribution of $\mathbf{Z}$. The answer is remarkable, so keep simplifying! Cite facts from the formula sheet when you use them. \item To see what happened, simplify the expression for $\mathbf{Z}$ itself as much as possible. Once you have the answer, look at the footnote\footnote{This is related to the concept of ``degrees of freedom." Because $p$ linear combinations of the residuals equal exactly zero, you can see that $p$ of them are linear combinations of the others. Thus, only $n-p$ are free to vary. Mr. Fisher was a very smart guy.}. \item What is $M_\mathbf{Z}(\mathbf{t})$? \end{enumerate} \item Suppose data for a regression study are collected at two different locations; $n_1$ observations are collected at location one, and $n_2$ observations are collected at location two. The same explanatory variables are used at each location. We need to know whether the error variance $\sigma^2$ is the same at the two locations. Recall the definition of the $F$ distribution. If $W_1 \sim \chi^2(\nu_1)$ and $W_2 \sim \chi^2(\nu_2)$ are independent, then $F = \frac{W_1/\nu_1}{W_2/\nu_2} \sim F(\nu_1,\nu_2)$. Suggest a statistic for testing $H_0: \sigma^2_1=\sigma^2_2$. Using facts from the formula sheet, show it has an $F$ distribution when $H_0$ is true. Don't forget to state the degrees of freedom. Assume that data coming from the two locations are independent. % \newpage \item This question will be a lot easier if you remember that if $X \sim \chi^2(\nu)$, then $E(X)=\nu$ and $Var(X)=2\nu$. You don't have to prove this; just use it. You can also use things you already know about ordinary linear regression with normal errors. For the usual linear regression model with normal errors, $\sigma^2$ is usually estimated with $MSE$. \begin{enumerate} \item Show that $MSE$ is an unbiased estimator of $\sigma^2$. \item Show that $MSE$ is a consistent estimator of $\sigma^2$. \item Under the usual regression model what is the joint distribution of $\epsilon_1, \ldots, \epsilon_n$? \item Let $T_n = \frac{1}{n} \sum_{i=1}^n \epsilon_i^2$. What is $E(T_n)$? \item How do you know that $T_n \stackrel{p}{\rightarrow} \sigma^2$? \item Show that $Var(T_n) < Var(MSE)$. \item So it would appear that $T_n$ is a better estimator of $\sigma^2$ than $MSE$ is, since they are both unbiased and the variance of $T_n$ is lower. So why do you think $MSE$ is used in regression analysis instead of $T_n$? \end{enumerate} \item Ordinary linear regression is often applied to data sets where the independent variables are best modeled as random variables. In what way does the usual conditional linear regression model with normal errors imply that (random) explanatory variables have zero covariance with the error term? Hint: Assume $\mathbf{X}_i$ as well as $\epsilon_i$ continuous. What is the conditional distribution of $\epsilon_i$ given $\mathbf{X}_i$? \item For a model with just one \emph{random} explanatory variable, show that $E(\epsilon_i|X_i=x_i)=0$ for all $x_i$ implies $Cov(X_i,\epsilon_i)=0$, so that a standard regression model without the normality assumption still implies zero covariance (though not necessarily independence) between the error term and explanatory variables. % \newpage \item \label{workout} In a study comparing the effectiveness of different exercise programmes, volunteers were randomly assigned to one of three exercise programmes ($A$, $B$, $C$) or put on a waiting list and told to work out on their own. Aerobic capacity is the body's ability to process oxygen. Aerobic capacity was measured before and after 6 months of participation in the program (or 6 months of being on the waiting list). The response variable was improvement in aerobic capacity. The explanatory variables were age (a covariate) and treatment group. \begin{enumerate} \item First consider a regression model with an intercept, and no interaction between age and treatment group. \begin{enumerate} \item Make a table showing how you would set up indicator dummy variables for treatment group. Make Waiting List the reference category \item Write the regression model. Please use $x$ for age, and make its regression coefficient $\beta_1$. \item In terms of $\beta$ values, what null hypothesis would you test to find out whether, allowing for age, the three exercise programmes differ in their effectiveness? \item Write the null hypothesis for the preceding question as $H_0: \mathbf{L}\boldsymbol{\beta}=\mathbf{0}$. Just give the $\mathbf{L}$ matrix. \item In terms of $\beta$ values, what null hypothesis would you test to find out whether Programme $B$ was better than the waiting list? \item In terms of $\beta$ values, what null hypothesis would you test to find out whether Programmes $A$ and $B$ differ in their effectiveness? \item Suppose you wanted to estimate the difference in average benefit between programmes $A$ and $C$ for a 27 year old participant. Give your answer in terms of $\widehat{\beta}$ values. \item Is it safe to assume that age is independent of the other explanatory variables? Answer Yes or No and briefly explain. \end{enumerate} % \newpage \item \label{interac} Now consider a regression model with an intercept and the interaction (actually a set of interactions) between age and treatment. \begin{enumerate} \item Write the regression model. Make it an extension of your earlier model. \item Suppose you wanted to know whether the slopes of the 4 regression lines were equal. In terms of $\beta$ values, what null hypothesis would you test? \item Suppose you wanted to know whether any differences among mean improvement in the four treatment conditions depends on the participant's age. In terms of $\beta$ values, what null hypothesis would you test? \item Write the null hypothesis for the preceding question as $H_0: \mathbf{L}\boldsymbol{\beta}=\mathbf{0}$. Just give the $\mathbf{L}$ matrix. It is $r \times p$. What is $r$? What is $p$? \item Suppose you wanted to know whether the difference in effectiveness between Programme $A$ and the Waiting List depends on the participant's age. In terms of $\beta$ values, what null hypothesis would you test? \item Suppose you wanted to \emph{estimate} the difference in average benefit between programmes $A$ and $C$ for a 27 year old participant. Give your answer in terms of $\widehat{\beta}$ values. \end{enumerate} % \newpage \item Now consider a regression model \emph{without} an intercept, but \emph{with} possibly unequal slopes. Make a table to show how the dummy variables could be set up, and write the regression model. Again, please use $x$ for age and make its regression coefficient $\beta_1$. This model needs to have the \emph{same number of regression coefficients as the model of Question~\ref{interac}}, so you have to think about this a little. For each treatment condition, what is the conditional expected value of $Y$? The answer is in terms of $x$ and the $\beta$ values. Please put these values as the last column of your table. \begin{enumerate} \item Suppose you wanted to know whether the slopes of the 4 regression lines were equal. In terms of $\beta$ values, what null hypothesis would you test? \item Suppose you wanted to know whether any differences among mean improvement in the four treatment conditions depends on the participant's age. In terms of $\beta$ values, what null hypothesis would you test? \item Write the null hypothesis for the preceding question as $H_0: \mathbf{L}\boldsymbol{\beta}=\mathbf{0}$. Just give the $\mathbf{L}$ matrix. It is $r \times p$. What is $r$? What is $p$? \item Suppose you wanted to know whether the difference in effectiveness between Programme $A$ and the Waiting List depends on the participant's age. In terms of $\beta$ values, what null hypothesis would you test? \item Suppose you wanted to estimate the difference in average benefit between programmes $A$ and $C$ for a 27 year old participant. Give your answer in terms of $\widehat{\beta}$ values. \end{enumerate} \end{enumerate} %%%%%%%%%%%%%%%%%%%%%%%%%%%%% Data %%%%%%%%%%%%%%%%%%%%%%%%% \item The data file \href{http://www.utstat.toronto.edu/~brunner/appliedf14/code_n_data/hw/CensusTract.data} {\texttt{CensusTract.data}}, comes from \emph{Applied Linear Statistical Models} (1996), by Neter et al.. The data are used here without permission. There is a link on the course home page in case the one in this document does not work. The cases (there are $n$ cases) are a sample of census tracts in the United States. For each census tract, the following variables are recorded. % \vspace{5mm} \begin{tabular}{ll} \texttt{area} & Land area in square miles \\ \texttt{pop} & Population in thousands \\ \texttt{urban} & Percent of population in cities \\ \texttt{old} & Percent of population 65 or older \\ \texttt{docs} & Number of active physicians \\ \texttt{beds} & Number of hospital beds \\ \texttt{hs} & Percent of population 25 or older completing 12+ years of school \\ \texttt{labor} & Number of persons 16+ employed or looking for work \\ \texttt{income} & Total Total before tax income in millions of dollars \\ \texttt{crimes} & Total number of serious crimes reported by police \\ \texttt{region} & Region of the country: 1=Northeast, 2=North Central, 3=South, 4=West \\ \end{tabular} \begin{enumerate} \item First, fit use R to fit a regression model with \texttt{crimes} as the response variable and just one explanatory variable: \texttt{pop}. \begin{enumerate} \item In plain, non-statistical language, what do you conclude from this analysis? The answer is something about population size and number of crimes. \item What proportion of the variation in number of crimes is explained by population size? The answer is a number between zero and 1. \end{enumerate} \textbf{Bring your printout to the quiz.} \item Based on that last analysis, we will create a new response variable called crime \emph{rate}, defined as number of crimes divided by population size. Now fit a new regression model in which crime rate is a function of \texttt{area}, \texttt{urban}, \texttt{old}, \texttt{docs}, \texttt{beds}, \texttt{hs}, \texttt{labor}, \texttt{income} and \texttt{region} of the country. There are no interactions for now. This is the \emph{full model} in all the analyses that follow. Just so we will be doing things the same way, please make \texttt{region} a factor, and look at help to see how to use the \texttt{labels=} option. It really helps. Based on this model, \begin{enumerate} \item What is $p$? The answer is a number. % 11 \item What is $\widehat{\beta}_4$? The answer is a number. % \item Give the test statistic, the degrees of freedom and the $p$-value for each of the following null hypotheses. The answers are numbers from your printout. \begin{enumerate} \item $H_0: \beta_1=\beta_2= \cdots = \beta_{11} = 0$ \item $H_0: \beta_7=0$ \item $H_0: \beta_0=0$ \end{enumerate} \item What proportion of the variation in crime rate is explained by the explanatory variables in this model? The answer is a number. % 0.4827 \item What is the smallest value of $e_i$? The answer is a number. % -26.7809 \item What is the largest value of $e_i$? The answer is a number. % 23.0755 \item Look at the output of \texttt{summary}. For the first entry under ``\texttt{t value}" (that's \texttt{1.502}), what is the null hypothesis? The answer is a symbolic statement involving one or more Greek letters. % H0: beta0=0 \item Look at the $F$ test at the end of the \texttt{summary} output. What is the null hypothesis? The answer is a symbolic statement involving one or more Greek letters. % H0: beta1 = ... = beta11 = 0 \item Controlling for all the other variables in the model, is percent High School graduates related to crime rate? \begin{enumerate} \item Give the null hypothesis in symbols. \item Give the value of the test statistic. The answer is a number from your printout. \item Give the $p$-value. The answer is a number from your printout. \item Do you reject the null hypothesis at $\alpha = 0.05$? Answer Yes or No. \item Allowing for other variables, census regions with a higher percentage of High School graduates tend to have \underline{~~~~~~~} crime rates. % higher \end{enumerate} \item Controlling for all the other variables in the model, is number of physicians related to crime rate? \begin{enumerate} \item Give the null hypothesis in symbols. \item Give the value of the test statistic. The answer is a number from your printout. \item Give the $p$-value. The answer is a number from your printout. \item Do you reject the null hypothesis at $\alpha = 0.05$? Answer Yes or No. \item Is there enough evidence to conclude that allowing for other variables, number of physicians is related to crime rate? \end{enumerate} \item Controlling for all the other variables in the model, is there a difference in crime rate between the Northeast and North Central regions? \begin{enumerate} \item Give the null hypothesis in symbols. \item Give the value of the test statistic. The answer is a number from your printout. \item Give the $p$-value. The answer is a number from your printout. \item Do you reject the null hypothesis at $\alpha = 0.05$? Answer Yes or No. \item State the conclusion in plain, non-statistical language. If there is a difference, say which region has a higher average crime rate! \end{enumerate} \item Controlling for all the other variables in the model, is there a difference in crime rate between the Northeast and South regions? \begin{enumerate} \item Give the null hypothesis in symbols. \item Give the value of the test statistic. The answer is a number from your printout. \item Give the $p$-value. The answer is a number from your printout. \item Do you reject the null hypothesis at $\alpha = 0.05$? Answer Yes or No. \item State the conclusion in plain, non-statistical language. If there is a difference, say which region has a higher average crime rate! \end{enumerate} \item I think it's remarkable that only one variable apart from region seems to make a difference once you allow for the others. Which one is it? \item But the other variables may be masking each other's relationship when each is controlled for all the others. Please test them all at once, with a view to maybe dropping them and obtaining a simpler model. \begin{enumerate} \item Give the null hypothesis in symbols. \item Give the value of the test statistic. The answer is a number from your printout. \item Give the $p$-value. The answer is a number from your printout. \item Do you reject the null hypothesis at $\alpha = 0.05$? Answer Yes or No. \item Is there evidence that, once we control for region and percent High School graduates, that any of these variables is related to the crime rate? \end{enumerate} \item To be continued \ldots \end{enumerate} \textbf{Bring your printout to the quiz.} \end{enumerate} \end{enumerate} \vspace{10mm} \noindent \begin{center}\begin{tabular}{l} \hspace{6.5in} \\ \hline \end{tabular}\end{center} This assignment was prepared by \href{http://www.utstat.toronto.edu/~brunner}{Jerry Brunner}, Department of Statistics, University of Toronto. It is licensed under a \href{http://creativecommons.org/licenses/by-sa/3.0/deed.en_US} {Creative Commons Attribution - ShareAlike 3.0 Unported License}. Use any part of it as you like and share the result freely. The \LaTeX~source code is available from the course website: \href{http://www.utstat.toronto.edu/~brunner/oldclass/appliedf14} {\texttt{http://www.utstat.toronto.edu/$^\sim$brunner/oldclass/appliedf14}} \end{document} ############################################################################ ############################################################################