\documentclass[12pt]{article} %\usepackage{amsbsy} % for \boldsymbol and \pmb \usepackage{graphicx} % To include pdf files! \usepackage{amsmath} \usepackage{amsbsy} \usepackage{amsfonts} % for \mathbb{R} The set of reals \usepackage[colorlinks=true, pdfstartview=FitV, linkcolor=blue, citecolor=blue, urlcolor=blue]{hyperref} % For links \usepackage{fullpage} %\pagestyle{empty} % No page numbers \begin{document} %\enlargethispage*{1000 pt} \begin{center} {\Large \textbf{STA 302f13 Assignment Eleven}}\footnote{Copyright information is at the end of the last page.} \vspace{1 mm} \end{center} \noindent \textbf{Please bring your printout for Question~\ref{census} to the quiz.} The other questions are just practice for the quiz, and are not to be handed in. \begin{enumerate} \item \label{centered} For this question, the \emph{uncentered} regression model refers to \begin{displaymath} Y_i = \beta_0 + \beta_1 x_{i1} + \cdots + \beta_k x_{ik} + \epsilon_i, \end{displaymath} and the \emph{centered} regression model refers to \begin{displaymath} Y_i = \alpha_0 + \alpha_1 (x_{i1}-\overline{x}_1) + \cdots + \alpha_k (x_{ik}-\overline{x}_k) + \epsilon_i. \end{displaymath} \begin{enumerate} \item Give $\alpha_0, \ldots, \alpha_k$ in terms of $\beta_0, \ldots, \beta_k$. \item Give $\beta_0, \ldots, \beta_k$ in terms of $\alpha_0, \ldots, \alpha_k$. \item When fitting the uncentered model by ordinary least squares, the quantity $Q(\boldsymbol{\beta})=\sum_{i=1}^n(Y_i-\beta_0 - \beta_1 x_{i1} - \cdots - \beta_k x_{ik})^2$ reaches its unique minimum when $\beta_0 = \widehat{\beta}_0, \beta_1 = \widehat{\beta}_1, \ldots, \beta_k = \widehat{\beta}_k$. Show that this same minimum is reached for the centered model when $\alpha_0 = \overline{Y}, \alpha_1 = \widehat{\beta}_1, \ldots, \alpha_k = \widehat{\beta}_k$. \item \label{centerboth} Why is it clear that you could estimate $\beta_1, \ldots, \beta_k$ by centering $Y$ as well as the $X$ variables, and then fitting a regression through the origin? % \item Verify that the matrix $\frac{1}{n-1} \, \mathbf{X}^\prime_c\mathbf{X}_c$ is a sample variance-covariance matrix. Show some calculations. \end{enumerate} \item Consider again the \texttt{furnace} data set described in Assignment 10. The model will have $Y$ = average energy consumption with vent damper in and vent damper out, and the independent variables are age of house ($X_1$), chimney area ($X_2$) and furnace type (4 categories). There should be no interactions in your model, and \emph{this time the covariates $X_1$ and $X_2$ are centered}. \begin{enumerate} \item Write $E[Y|\mathbf{X}_c]$ for your model. Of course only $X_1$ and $X_2$ are centered. \item Make a table with four rows, showing \emph{estimated} expected energy consumption ($\widehat{Y}$) for houses of average (sample mean) age and average (sample mean) chimney area. There is one estimate for each furnace type. Give your answer in terms of $\widehat{\beta}$ values based on your model. \end{enumerate} \vspace{40mm} \pagebreak \item As in Assignment 10, the performance of High School History students is the dependent variable in a regression with the following variables: \begin{itemize} \item[$X_1$] Equals 1 if the class uses the discovery-oriented curriculum, and equals 0 the class it uses the memory-oriented curriculum. \item[$X_2$] Average parents' education for the classroom \item[$X_3$] Average parents' income for the classroom \item[$X_4$] Number of university History courses taken by the teacher \item[$X_5$] Teacher's final cumulative university grade point average \item[$Y~$] Class median score on the standardized history test. \end{itemize} The variables $X_2$ through $X_5$ are centered this time. \begin{enumerate} \item Write the equation for a regression model that includes interaction terms allowing the possibility that the two regression planes (one for the discovery-oriented curriculum and one for the memory-oriented curriculum) are not parallel. \item Make a table with two rows, showing the expected performance for each curriculum type. % \pagebreak \item In terms of the $\beta$ coefficients of your model, what null hypothesis would you test to answer each of the following questions? \begin{enumerate} \item Are the two regression planes parallel? \item Holding the covariates constant at their sample mean values, is average performance different for the two curriculum type? \end{enumerate} \item Write the above two null hypotheses in matrix form as $H_0: \mathbf{C}\boldsymbol{\beta} = \mathbf{t}$. \item In terms of $\widehat{\beta}$ values, give the estimated expected performance for students in classes that are average on $X_2$ through $X_5$. Give one answer for the discovery-oriented curriculum and one for the memory-oriented curriculum. \end{enumerate} \pagebreak % Random IV \item In the usual univariate multiple regression model, the $\mathbf{X}$ is an $n \times (k+1)$ matrix of known constants. But of course in practice, the independent variables are often random, not fixed. Clearly, if the model holds \emph{conditionally} upon the values of the independent variables, then all the usual results hold, again conditionally upon the particular values of the independent variables. The probabilities (for example, $p$-values) are conditional probabilities, and the $F$ statistic does not have an $F$ distribution, but a conditional $F$ distribution, given $\mathbf{X=x}$. \begin{enumerate} \item Show that the least-squares estimator $\widehat{\boldsymbol{\beta}}= (\mathbf{X}^{\prime}\mathbf{X})^{-1}\mathbf{X}^{\prime}\mathbf{Y}$ is conditionally unbiased. You've done this before. \item Show that $\widehat{\boldsymbol{\beta}}$ is also unbiased unconditionally. \item A similar calculation applies to the significance level of a hypothesis test. Let $F$ be the test statistic (say for an $F$-test comparing full and reduced models), and $f_c$ be the critical value. If the null hypothesis is true, then the test is size $\alpha$, conditionally upon the independent variable values. That is, $P(F>f_c|\mathbf{X=x})=\alpha$. Find the \emph{unconditional} probability of a Type I error. Assume that the independent variables are discrete, so you can write a multiple sum. \end{enumerate} % \pagebreak \item Consider the following model with random independent variables. Independently for $i=1, \ldots, n$, \begin{eqnarray*} Y_i &=& \alpha + \beta_1 X_{i1} + \cdots + \beta_k X_{ik} + \epsilon_i \\ &=& \alpha + \boldsymbol{\beta}^\prime \mathbf{X}_i + \epsilon_i, \end{eqnarray*} where \begin{displaymath} \mathbf{X}_i = \left( \begin{array}{c} X_{i1} \\ \vdots \\ X_{ik} \end{array} \right) \end{displaymath} and $\mathbf{X}_i$ is independent of $\epsilon_i$. Here, the symbol $\alpha$ is used differently than in Question~\ref{centered}. This time it's the intercept of an uncentered model; and $\boldsymbol{\beta}$ does not include the intercept. The ``independent" variables $\mathbf{X}_i = (X_{i1}, \ldots, X_{ik})^\prime$ are not statistically independent. They have the symmetric and positive definite $k \times k$ covariance matrix $\boldsymbol{\Sigma}_x = [\sigma_{ij}]$, which need not be diagonal. They also have the $k \times 1$ vector of expected values $\boldsymbol{\mu}_x = (\mu_1, \ldots, \mu_k)^\prime$. \begin{enumerate} % \item What is $Cov(X_{i1},Y_i)$? Express your answer in terms of $\beta$ and $\sigma_{ij}$ quantities. Show your work. \item Let $\boldsymbol{\Sigma}_{xy}$ denote the $k \times 1$ matrix of covariances between $Y_i$ and $X_{ij}$ for $j=1, \ldots, k$. Calculate $\boldsymbol{\Sigma}_{xy} = C(\mathbf{X}_i,Y_i)$, obtaining $\boldsymbol{\Sigma}_{xy} = \boldsymbol{\Sigma}_x \boldsymbol{\beta}$. \item Solve the equation above for $\boldsymbol{\beta}$ in terms of $\boldsymbol{\Sigma}_x$ and $\boldsymbol{\Sigma}_{xy}$. \item Using the expression you just obtained and letting $\widehat{\boldsymbol{\Sigma}}_x$ and $\widehat{\boldsymbol{\Sigma}}_{xy}$ denote matrices of \emph{sample} variances and covariances, what would be a reasonable estimator of $\boldsymbol{\beta}$ that you could calculate from sample data? \item To see that your ``reasonable" (Method of Moments) estimator is actually the usual one, first verify that the matrix $\frac{1}{n-1} \, \mathbf{X}^\prime_c\mathbf{X}_c$ is a sample variance-covariance matrix. Show some calculations. What about $\frac{1}{n-1} \, \mathbf{X}^\prime_c\mathbf{Y}_c$? \item In terms of $\widehat{\boldsymbol{\Sigma}}_x$ and $\widehat{\boldsymbol{\Sigma}}_{xy}$, what is $\widehat{\boldsymbol{\beta}} = (\mathbf{X}^\prime_c \mathbf{X}_c)^{-1} \mathbf{X}^\prime_c \mathbf{Y}_c$? \end{enumerate} % \pagebreak \item \label{census} Please return to the Census Tract data of Assignments Seven and Ten. This time, fit a regression model in which crime rate is a function of just \texttt{docs} and \texttt{region}, but \texttt{docs} is centered and there are interactions in the full model. Remember that for \texttt{region}, 1=Northeast, 2=North Central, 3=South and 4=West. Make Northeast the reference category. \begin{enumerate} \item Estimate the expected crime rate for each region when the number of doctors is held constant at the sample mean level. Your answer is a set of four numbers. \item Carry out tests to answer the following questions. In each case, be able to give the value of the test statistic ($t$ or $F$), the $p$-value, state a conclusion in plain, non-technical language --- except for the last one, where the answer is just Yes or No. \begin{enumerate} \item For census tracts with an average (sample mean) number of doctors, is there a diference in expected crime rate between the Northeast and West regions? \item For census tracts with an average (sample mean) number of doctors, is there a diference in expected crime rate between the Northeast and South regions? \item For census tracts with an average (sample mean) number of doctors, is there a diference in expected crime rate between the North Central and South regions? \item For census tracts with an average (sample mean) number of doctors, is there a diference in expected crime rate between the North Central and West regions? \item For census tracts with an average (sample mean) number of doctors, is there a diference in expected crime rate between the South and West regions? \item Are the regression lines for the Northeast and South regions parallel? \item Is there evidence that the regression lines for the four regions are not parallel? % There's no overall test for region at x-bar! \end{enumerate} \end{enumerate} \textbf{Bring your R printout to the quiz.} \end{enumerate} % \vspace{20mm} \noindent \begin{center}\begin{tabular}{l} \hspace{6in} \\ \hline \end{tabular}\end{center} This assignment was prepared by \href{http://www.utstat.toronto.edu/~brunner}{Jerry Brunner}, Department of Statistical Sciences, University of Toronto. It is licensed under a \href{http://creativecommons.org/licenses/by-sa/3.0/deed.en_US} {Creative Commons Attribution - ShareAlike 3.0 Unported License}. Use any part of it as you like and share the result freely. The \LaTeX~source code is available from the course website: \href{http://www.utstat.toronto.edu/~brunner/oldclass/302f13} {\small\texttt{http://www.utstat.toronto.edu/$^\sim$brunner/oldclass/302f13}} \end{document} \vspace{5mm} \noindent # R work for STA302f13 Assignment 11 rm(list=ls()) census = read.table("http://www.utstat.toronto.edu/~brunner/302f13/code_n_data/hw/CensusTract.data") attach(census) cdocs = docs-mean(docs) crimerate = crimes/pop region=factor(region,labels=c("NE","NC","S","W" )) fullmod = lm(crimerate ~ cdocs + region + cdocs:region) # I didn't show them this easy way. summary(fullmod) # Remaining pairwise diffs at xbar betahat = fullmod$coefficients; betahat dfe = fullmod$df.residual V = vcov(fullmod) # NC vs S a = rbind(0,0,1,-1,0,0,0,0) T = as.numeric( t(a)%*%betahat/sqrt(t(a)%*%V%*%a) ); T p = 2*(1-pt(abs(T),dfe)); p # NC vs W a = rbind(0,0,1,0,-1,0,0,0) T = as.numeric( t(a)%*%betahat/sqrt(t(a)%*%V%*%a) ); T p = 2*(1-pt(abs(T),dfe)); p # S vs W a = rbind(0,0,0,1,-1,0,0,0) T = as.numeric( t(a)%*%betahat/sqrt(t(a)%*%V%*%a) ); T p = 2*(1-pt(abs(T),dfe)); p # Test 4 slopes equal parallel = lm(crimerate ~ cdocs + region) anova(parallel,fullmod)