\documentclass[10pt]{article} %\usepackage{amsbsy} % for \boldsymbol and \pmb \usepackage{graphicx} % To include pdf files! \usepackage{amsmath} \usepackage{amsbsy} \usepackage{amsfonts} \usepackage[colorlinks=true, pdfstartview=FitV, linkcolor=blue, citecolor=blue, urlcolor=blue]{hyperref} % For links \usepackage{fullpage} %\pagestyle{empty} % No page numbers \begin{document} %\enlargethispage*{1000 pt} \begin{center} {\Large \textbf{STA 2101/442 Assignment Six}}\footnote{Copyright information is at the end of the last page.} \vspace{1 mm} \end{center} \noindent The questions are just practice for the quiz, and are not to be handed in. Use R for Question~\ref{R}, and bring your printout to the quiz. \textbf{Your printout should show \emph{all} R input and output, and \emph{only} R input and output}. Do not write anything on your printouts except your name and student number. \vspace{1mm} \begin{enumerate} \item \label{tsq} Suppose you wish to test the null hypothesis that a \emph{single} linear combination of regression coefficients is equal to zero. You can test either $H_0: \mathbf{a}^\top\boldsymbol{\beta} = 0$ with a two-sided $t$-test, or $H_0: \mathbf{L}\boldsymbol{\beta} = 0$ with an $F$-test. Referring to the formula sheet, verify that $F=t^2$. Show your work. \item \label{rowequiv} The exact way that you express a linear null hypothesis does not matter. Let $\mathbf{A}$ be an $r \times r$ nonsingular matrix (meaning $\mathbf{A}^{-1}$ exists), so that $\mathbf{L}\boldsymbol{\beta} = \mathbf{h}$ if and only if $\mathbf{AL}\boldsymbol{\beta} = \mathbf{Ah}$. This is a useful way to express a logically equivalent null hypothesis, because any matrix that is row equivalent to $\mathbf{L}$ can be written as $\mathbf{AL}$. Show that the general linear test statistic $F$ for testing $H_0: (\mathbf{AL})\boldsymbol{\beta} = \mathbf{Ah}$ is the same as the one for testing $H_0: \mathbf{L}\boldsymbol{\beta} = \mathbf{h}$. Use the fact that if the inverses exist, the inverse of a matrix product is the product of inverses, in reverse order. \item \label{transformX} Also, all those dummy variable coding schemes are equivalent. Let $\mathbf{A}$ be a $p \times p$ nonsingular matrix (it's a different $\mathbf{A}$ from the one in Question~\ref{rowequiv}). Note that $\mathbf{X}^*=\mathbf{XA}$ is a one-to-one linear transformation of the explanatory variables, and \begin{displaymath} \mathbf{y} = \mathbf{X} \boldsymbol{\beta} + \boldsymbol{\epsilon} ~ \Leftrightarrow ~ \mathbf{y} = \mathbf{XA} \, \mathbf{A}^{-1}\boldsymbol{\beta} + \boldsymbol{\epsilon} = \mathbf{X}^* \boldsymbol{\beta}^* + \boldsymbol{\epsilon}. \end{displaymath} This is already interesting, because it shows how transforming the explanatory variables changes the meaning of the regression coefficients. Refer to $ \mathbf{y} = \mathbf{X} \boldsymbol{\beta} + \boldsymbol{\epsilon}$ as the ``original" model, and $ \mathbf{y} = \mathbf{X}^* \boldsymbol{\beta}^* + \boldsymbol{\epsilon}$ as the ``transformed" model. \begin{enumerate} \item Just to make this more concrete, suppose you have a 3-category explanatory variable and a quantiative covariate. $Y_i = \beta_0 + \beta_1d_{i,1} + \beta_2d_{i,2} + \beta_3x_{i} + \epsilon_i$, where $d_{i,1}$ and $d_{i,2}$ are indicator dummy variables for the first two groups. You want to switch to cell means coding, so that $Y_i = \beta^*_1g_{i,1} + \beta^*_2g_{i,2} + \beta^*_3g_{i,3} + \beta^*_4x_{i} + \epsilon_i$. Note that $\beta^*_4 = \beta_3$. Give the matrix $\mathbf{A}$; you can make tables if that helps. \item Write down the least squares estimate $\widehat{\boldsymbol{\beta}}^*$ for the transformed model, and simplify. How is $\widehat{\boldsymbol{\beta}}^*$ related to $\widehat{\boldsymbol{\beta}}$? \item Compare the vector of predicted values from the two models. \item Compare the vector of residuals from the two models. \item Which is greater, $SSE$ or $SSE^*$? \item Suppose you want to test $H_0: \mathbf{L}\boldsymbol{\beta} = \mathbf{h}$. Give the equivalent null hypothesis for the transformed model. That is, what are matrices $\mathbf{L}^*$, $\boldsymbol{\beta}^*$ and $\mathbf{h}^*$ in $H_0: \mathbf{L}^*\boldsymbol{\beta}^* = \mathbf{h}^*$? \item Compare the $F$ statistic for $H_0: \mathbf{L}^*\boldsymbol{\beta}^* = \mathbf{h}^*$ to the $F$ statistic for $H_0: \mathbf{L}\boldsymbol{\beta} = \mathbf{h}$. \end{enumerate} \item \label{onecol} Question \ref{transformX} suggests that if a regression model with no intercept is equivalent to one with an intercept, then the residuals will add to zero. This is good to know, because it means $SST=SSR+SSE$, and $R^2$ is meaningful; so is $a$, the proportion of remaining variation. Here is an easy condition to check. Let $\mathbf{1}$ denote an $n \times 1$ column of ones. Show that if there is a $p \times 1$ vector of constants $\mathbf{v}$ with $\mathbf{Xv}= \mathbf{1}$, then $\sum_{i=1}^ne_i=0$. (Another way to state this is that if there is a linear combination of the columns of $\mathbf{X}$ that equals a column of ones, then the sum of residuals equals zero. Clearly this applies to a model with cell means coding.) \pagebreak %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \item \label{R} The data file \href{http://www.utstat.toronto.edu/~brunner/data/illegal/CensusTract.data.txt} {\texttt{CensusTract.data.txt}}, comes from \emph{Applied Linear Statistical Models} (1996), by Neter et al.. The data are used here without permission. You can get the data with {\footnotesize \begin{verbatim} census = read.table("http://www.utstat.toronto.edu/~brunner/data/illegal/CensusTract.data.txt"). \end{verbatim} } % End size The cases (there are $n$ cases) are a sample of census tracts in the United States. For each census tract, the following variables are recorded. % \vspace{5mm} \begin{tabular}{ll} \texttt{area} & Land area in square miles \\ \texttt{pop} & Population in thousands \\ \texttt{urban} & Percent of population in cities \\ \texttt{old} & Percent of population 65 or older \\ \texttt{docs} & Number of active physicians \\ \texttt{beds} & Number of hospital beds \\ \texttt{hs} & Percent of population 25 or older completing 12+ years of school \\ \texttt{labor} & Number of persons 16+ employed or looking for work \\ \texttt{income} & Total Total before tax income in millions of dollars \\ \texttt{crimes} & Total number of serious crimes reported by police \\ \texttt{region} & Region of the country: 1=Northeast, 2=North Central, 3=South, 4=West \\ \end{tabular} \begin{enumerate} \item First, use R to fit a regression model with \texttt{crimes} as the response variable and just one explanatory variable: \texttt{pop}. \begin{enumerate} \item In plain, non-statistical language, what do you conclude from this analysis? The answer is something about population size and number of crimes. \item What proportion of the variation in number of crimes is explained by population size? The answer is a number between zero and one. \end{enumerate} \item Based on that last analysis, we will create a new response variable called crime \emph{rate}, defined as number of crimes divided by population size. Now fit a new regression model in which crime rate is a function of \texttt{area}, \texttt{urban}, \texttt{old}, \texttt{docs}, \texttt{beds}, \texttt{hs}, \texttt{labor}, \texttt{income} and \texttt{region} of the country. There are no interactions for now. This is the \emph{full model} in all the analyses that follow. Just so we will be doing things the same way, please make \texttt{region} a factor, and look at help to see how to use the \texttt{labels=} option. It really helps. Based on this model, \begin{enumerate} \item What is $p$? The answer is a number. % 11 \item What is $\widehat{\beta}_4$? The answer is a number. % \item Give the test statistic, the degrees of freedom and the $p$-value for each of the following null hypotheses. The answers are numbers from your printout. \begin{enumerate} \item $H_0: \beta_1=\beta_2= \cdots = \beta_{11} = 0$ \item $H_0: \beta_7=0$ \item $H_0: \beta_0=0$ \end{enumerate} \item What proportion of the variation in crime rate is explained by the explanatory variables in this model? The answer is a number. % 0.4827 \item What is the smallest value of $e_i$? The answer is a number. % -26.7809 \item What is the largest value of $e_i$? The answer is a number. % 23.0755 \item Look at the output of \texttt{summary}. For the first entry under ``\texttt{t value}" (that's \texttt{1.502}), what is the null hypothesis? The answer is a symbolic statement involving one or more Greek letters. % H0: beta0=0 \item Look at the $F$ test at the end of the \texttt{summary} output. What is the null hypothesis? The answer is a symbolic statement involving one or more Greek letters. % H0: beta1 = ... = beta11 = 0 \pagebreak %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \item Controlling for all the other variables in the model, is percent High School graduates related to crime rate? \begin{enumerate} \item Give the null hypothesis in symbols. \item Give the value of the test statistic. The answer is a number from your printout. \item Give the $p$-value. The answer is a number from your printout. \item Do you reject the null hypothesis at $\alpha = 0.05$? Answer Yes or No. \item Allowing for other variables, census regions with a higher percentage of High School graduates tend to have \underline{~~~~~~~} crime rates. % higher \end{enumerate} \item Controlling for all the other variables in the model, is number of physicians related to crime rate? \begin{enumerate} \item Give the null hypothesis in symbols. \item Give the value of the test statistic. The answer is a number from your printout. \item Give the $p$-value. The answer is a number from your printout. \item Do you reject the null hypothesis at $\alpha = 0.05$? Answer Yes or No. \item Is there enough evidence to conclude that allowing for other variables, number of physicians is related to crime rate? \end{enumerate} \item Controlling for all the other variables in the model, is there a difference in crime rate between the Northeast and North Central regions? \begin{enumerate} \item Give the null hypothesis in symbols. \item Give the value of the test statistic. The answer is a number from your printout. \item Give the $p$-value. The answer is a number from your printout. \item Do you reject the null hypothesis at $\alpha = 0.05$? Answer Yes or No. \item State the conclusion in plain, non-statistical language. If there is a difference, say which region has a higher average crime rate! \end{enumerate} \item Controlling for all the other variables in the model, is there a difference in crime rate between the Northeast and South regions? \begin{enumerate} \item Give the null hypothesis in symbols. \item Give the value of the test statistic. The answer is a number from your printout. \item Give the $p$-value. The answer is a number from your printout. \item Do you reject the null hypothesis at $\alpha = 0.05$? Answer Yes or No. \item State the conclusion in plain, non-statistical language. If there is a difference, say which region has a higher average crime rate! \end{enumerate} \item I think it's remarkable that only one variable apart from region seems to make a difference once you allow for the others. Which one is it? \item But the other variables may be masking one another's relationship to the response variable when each one is controlled for all the others. Please test them all at once, with a view to maybe dropping them and obtaining a simpler model. \begin{enumerate} \item Give the null hypothesis in symbols. \item Give the value of the test statistic. The answer is a number from your printout. \item Give the $p$-value. The answer is a number from your printout. \item Do you reject the null hypothesis at $\alpha = 0.05$? Answer Yes or No. \item Is there evidence that, once we control for region and percent High School graduates, that any of these variables is related to the crime rate? \end{enumerate} \item To be continued \ldots \end{enumerate} \end{enumerate} Please bring your printout to the quiz. \textbf{Your printout should show \emph{all} R input and output, and \emph{only} R input and output}. Do not write anything on your printouts except your name and student number. \pagebreak %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%% Back to ideas \item Independently for $i = 1, \ldots, n$, let $Y_i = \beta X_i + \epsilon_i$, where $X_i \sim N(\mu,\sigma^2_x)$ and $\epsilon_i \sim N(0,\sigma^2_\epsilon)$. Because of omitted variables that influence both $X_i$ and $Y_i$, we have $Cov(X_i,\epsilon_i) = c \neq 0$. \begin{enumerate} \item The least squares estimator of $\beta$ is $\frac{\sum_{i=1}^n X_iY_i}{\sum_{i=1}^n X_i^2}$. Is this estimator consistent? Answer Yes or No and prove your answer. \item Give the parameter space for this model. There are some constraints on $c$. \item First consider points in the parameter space where $\mu \neq 0$. Give an estimator of $\beta$ that converges almost surely to the right answer for that part of the parameter space. If you are not sure how to proceed, try calculating the expected value and covariance matrix of $(X_i,Y_i)$. \item What happens in the rest of the parameter space --- that is, where $\mu=0$? Is a consistent estimator possible there? So we see that parameters may be identifiable in some parts of the parameter space but not all. \end{enumerate} \item \label{poisson} % See 2013 Assignment 6 for data, more detail. Men and women are calling a technical support line according to independent Poisson processes with rates $\lambda_1$ and $\lambda_2$ per hour. Data for 144 hours are available, but unfortunately the sex of the caller was not recorded. All we have is the number of callers for each hour, which is distributed Poisson($\lambda_1+\lambda_2$). The parameter in this problem is $\boldsymbol{\theta} = (\lambda_1,\lambda_2)$. Try to find the MLE analytically. Show your work. Are there any points in the parameter space where both partial derivatives are zero? Why did estimation fail for this fairly realistic model? \item \label{corr} Show that for a simple regression with an intercept and one explanatory variable, $R^2=r^2$, where $r = \frac{\sum_{i=1}^n (x_i-\overline{x})(y_i-\overline{y})} {\sqrt{\sum_{i=1}^n (x_i-\overline{x})^2} \sqrt{\sum_{i=1}^n (Y_i-\overline{Y})^2}}$ is the ordinary sample correlation coefficient. You may also use the formulas $\widehat{\beta}_1 = \frac{\sum_{i=1}^n (x_i-\overline{x})(Y_i-\overline{Y})} {\sum_{i=1}^n (x_i-\overline{x})^2}$ and $\widehat{\beta}_0 = \overline{Y} - \widehat{\beta}_1\overline{x}$. It helps to start with the formula for $R^2$, and then substitute for $\widehat{Y}_i$ right away. \item We know that omitted explanatory variables are a big problem, because they induce non-zero covariance between the explanatory variables and the error terms $\epsilon_i$. The residuals have a lot in common with the $\epsilon_i$ terms in a regression model, though they are not the same thing. In class, somebody suggested checking for correlation between explanatory variables and the $\epsilon_i$ values by looking at the correlation between the residuals and explanatory variables. Accordingly, for a multiple regression model satisfying the condition of Question~\ref{onecol} % or with an intercept (so that $\sum_{i=1}^ne_i=0$), calculate the sample correlation $r$ between explanatory variable $j$ and the residuals $e_1, \ldots, e_n$. The final answer is a number. Does this suggest what \emph{plots} of the residuals versus explanatory variables should look like if the model is okay? \item Again for a model for which $\sum_{i=1}^ne_i=0$, calculate the sample correlation between the residuals and predicted values $\widehat{Y}_i$. Does this suggest what a \emph{plot} of the residuals versus predicted values should look like if the model is okay? \item Still for a model for which $\sum_{i=1}^ne_i=0$, show that the squared correlation between the predicted and observed response variable values is equal to $R^2$. Hint: As preparation, verify that $\mathbf{y}^\top\widehat{\mathbf{y}} = \widehat{\mathbf{y}}^\top\widehat{\mathbf{y}}$. Thus, a scatterplot of $Y$ versus $\widehat{Y}$ gives a picture of how well the explanatory variables are doing their job. How do you know that the correlation is always non-negative? \end{enumerate} % \vspace{50mm} \noindent \begin{center}\begin{tabular}{l} \hspace{6in} \\ \hline \end{tabular}\end{center} This assignment was prepared by \href{http://www.utstat.toronto.edu/~brunner}{Jerry Brunner}, Department of Statistics, University of Toronto. It is licensed under a \href{http://creativecommons.org/licenses/by-sa/3.0/deed.en_US} {Creative Commons Attribution - ShareAlike 3.0 Unported License}. Use any part of it as you like and share the result freely. The \LaTeX~source code is available from the course website: \href{http://www.utstat.toronto.edu/~brunner/oldclass/appliedf16} {\texttt{http://www.utstat.toronto.edu/$^\sim$brunner/oldclass/appliedf16}} \end{document} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%