\documentclass[12pt]{article} %\usepackage{amsbsy} % for \boldsymbol and \pmb \usepackage{graphicx} % To include pdf files! \usepackage{amsmath} \usepackage{amsbsy} \usepackage{amsfonts} % for \mathbb{R} The set of reals \usepackage[colorlinks=true, pdfstartview=FitV, linkcolor=blue, citecolor=blue, urlcolor=blue]{hyperref} % For links \usepackage{fullpage} %\pagestyle{empty} % No page numbers \begin{document} %\enlargethispage*{1000 pt} \begin{center} {\Large \textbf{STA 302f14 Assignment Nine}}\footnote{Copyright information is at the end of the last page.} \vspace{1 mm} \end{center} \noindent \textbf{Please bring your printout for Question~\ref{census} to the quiz.} The other questions are just practice for the quiz, and are not to be handed in. \begin{enumerate} \item \label{scalar} Look at the general linear model in scalar form on the formula sheet. Suppose that each observation $Y_i$ is actually the mean of $n_i$ independent observations with common variance $\sigma^2$. For example, $Y_i$ could be the average customer satisfaction rating at a bank branch, but different numbers of customers took the survey at each branch. \begin{enumerate} \item What is $Var(Y_i)$? \item Because this is a regression model, let's make this the variance of $\epsilon_i$, so that it's also the variance of $Y_i$. Do the $\epsilon_i$ all have equal variance now? \item Multiply both sides of the regression model equation by a constant $c_i$ (different for each $i$), obtaining \begin{displaymath} Y_i^* = \beta_0^* + \beta_1 x_{i1}^* + \cdots + \beta_k x_{ik}^* + \epsilon_i^*. \end{displaymath} Call this the ``transformed model." Choose the constants $c_i$ so that the variances of all the $\epsilon_i^*$ are equal. What is $c_i$? % Notice $c_i = \sqrt{w_i}$ \item Remember that the least squares problem is to minimize \begin{displaymath} Q = \sum_{i=1}^n (Y_i - \beta_0 - \beta_1 x_{i1} - \cdots - \beta_k x_{ik} )^2 \end{displaymath} Write $Q$ for the transformed model in terms of the $n_i$, and simplify (factor something out). This is a simple but very useful version of ``weighted least squares," in which some observations get more weight than others in determining $\widehat{\boldsymbol{\beta}}$. What are the ``weights?" \end{enumerate} \item Here is a generalization of Question~\ref{scalar}. In the general linear regression model, let $cov(\boldsymbol{\epsilon}) = \sigma^2 \mathbf{V}$, where $\mathbf{V}$ is a \emph{known} symmetric and positive definite matrix. As usual, $\sigma^2$ is an unknown constant. \begin{enumerate} \item What is the $cov(\mathbf{Y})$ for this unequal variance model? \item Multiply both sides of $\mathbf{Y} = \mathbf{X} \boldsymbol{\beta} + \boldsymbol{\epsilon}$ by $\mathbf{V}^{-1/2}$, obtaining the ``transformed" model $\mathbf{Y}^{^*} = \mathbf{X}^{^*} \boldsymbol{\beta} + \boldsymbol{\epsilon}^*$. Notice that $\boldsymbol{\beta}$ is the same for the original (unequal variance) model and the transformed model. \begin{enumerate} \item What is the variance-covariance matrix of $\boldsymbol{\epsilon}^*$? \item What is the matrix $\mathbf{V}$ for Question~\ref{scalar}? \end{enumerate} \item Write down and simplify a formula for $\widehat{\boldsymbol{\beta}}^*$. \item Is $\widehat{\boldsymbol{\beta}}^*$ unbiased given the unequal variance model? Answer Yes or No and show your work. \item Is $\widehat{\boldsymbol{\beta}}$ unbiased given the unequal variance model? Answer Yes or No and show your work. \item Given the unequal variance model, which has the smaller variance, $\mathbf{a}^\prime\widehat{\boldsymbol{\beta}}^*$, or $\mathbf{a}^\prime\widehat{\boldsymbol{\beta}}$? Why? \item If $\boldsymbol{\epsilon} \sim N(\mathbf{0}, \sigma^2 \mathbf{V})$, what is the distribution of $\widehat{\boldsymbol{\beta}}^*$? Show your work. \end{enumerate} \item The Wisconsin Power and Light Company studied the effectiveness of two devices for improving the efficiency of gas home-heating systems. The electric vent damper (EVD) reduces heat loss through the chimney when the furnace is in the off cycle by closing off the vent. It is controlled electrically. The thermally activated vent damper (TVD) is the same as the EVD except it is controlled by the thermal properties of a set of bimetal fins set in the vent. Ninety test houses were randomly assigned to have a free vent damper installed; 40 received EVDs and 50 received TVDs. For each house, energy consumption was measured for a period of several weeks with the vent damper active (``vent damper in") and for an equal period with the vent damper not active (``vent damper out". Here are the variables: \begin{itemize} \item[] House Identification Number \item[] Type of furnace (1=Forced air 2=Gravity 3=Forced water 4=Steam) \item[] Chimney area \item[] Chimney shape (1=Round 2=Square 3=Rectangular) \item[] Chimney height in feet \item[] Type of Chimney liner (0=Unlined 1=Tile 2=Metal) \item[] Type of house (1=Ranch 2=Two-story 3=tri-level 4=Bi-level 5=One and a half stories) \item[] House age in yrs \item[] Type of damper (1=EVD 0=TVD) \item[] Energy consumpt with damper active (in) \item[] Energy consumpt with damper inactive (out) \end{itemize} Consider a model in which the response variable (Y) is average energy consumption with vent damper in and vent damper out, and the explanatory variables are age of house ($X_1$), chimney area ($X_2$) and furnace type (4 categories). There should be no interactions in your model. \begin{enumerate} \item Write $E[Y|\mathbf{X}]$ for your model. This would be the \emph{full} model for any $F$-test that uses the full versus reduced approach. \item Make a table with four rows, one for each type of furnace. Make columns showing how your dummy variables are defined, and include one wider column at the end, showing $E[Y|\mathbf{X}]$ for each furnace type. \item You want to test whether, controlling for age of house and chimney area, average energy consumption depends on furnace type. \begin{enumerate} \item Give the null hypothesis in terms of the $\beta$s. \item Give $E[Y|\mathbf{X}]$ for the reduced model. \end{enumerate} \item You want to test whether, controlling for furnace type and chimney area, average energy consumption depends on age of house. \begin{enumerate} \item Give the null hypothesis in terms of the $\beta$s. \item Give $E[Y|\mathbf{X}]$ for the reduced model. \end{enumerate} \item You want to test whether, controlling for age of house and chimney area, average energy is different for Forced air furnaces and Gravity furnaces. \begin{enumerate} \item Give the null hypothesis in terms of the $\beta$s. \item Give $E[Y|\mathbf{X}]$ for the reduced model. \end{enumerate} \item You want to test whether, controlling for age of house and chimney area, average energy consumption is different for Forced air and Forced water furnaces. \begin{enumerate} \item Give the null hypothesis in terms of the $\beta$s. \item Give $E[Y|\mathbf{X}]$ for the reduced model. \end{enumerate} \item You want to test whether, controlling for age of house and chimney area, average energy consumption is for Steam furnaces is different from the average of Forced air and Forced water furnaces. (You are comparing an expected value with the mean of two expected values.) \begin{enumerate} \item Give the null hypothesis in terms of the $\beta$s. \item Give $E[Y|\mathbf{X}]$ for the reduced model. \end{enumerate} \end{enumerate} \newpage \item High School History classes from across Ontario are randomly assigned to either a discovery-oriented or a memory-oriented curriculum in Canadian history. At the end of the year, the students are given a standardized test and the median score of each class is recorded. Please consider a regression model with these variables: \begin{itemize} \item[$X_1$] Equals 1 if the class uses the discovery-oriented curriculum, and equals 0 the class it uses the memory-oriented curriculum. \item[$X_2$] Average parents' education for the classroom \item[$X_3$] Average parents' income for the classroom \item[$X_4$] Number of university History courses taken by the teacher \item[$X_5$] Teacher's final cumulative university grade point average \item[$Y~$] Class median score on the standardized history test. \end{itemize} The full regression model has $E[Y|\mathbf{X}] = \beta_0 + \beta_1x_1 + \beta_2x_2 + \beta_3x_3 + \beta_4x_4 + \beta_5x_5$. Give $E[Y|\mathbf{X}]$ for the reduced model you would use to answer each of the following questions. Don't re-number the variables. Also, for each question please give the null hypothesis in terms of $\beta$ values. \begin{enumerate} \item If you control for parents' education and income and for teacher's university background, does curriculum type affect test scores? (And why is it okay to use the word ``affect?") \item Controlling for parents' education and income and for curriculum type, is teacher's university background (two variables) related to their students' test performance? \item Controlling for teacher's university background and for curriculum type, are parents' education and income (considered simultaneously) related to students' test performance? \item Controlling for curriculum type, teacher's university background and parents' education, is parents' income related to students' test performance? \end{enumerate} \item In a study of recovery from spinal cord injury, patients were randomly assigned to four different physical therapy programmes, which will be called $A$, $B$, $C$ and $D$. The dependent variable is ``mobility" (basically how well the patients can move around on their own) after two months, and severity of the initial injury is a covariate. Call the covariate $x$, and call the dummy variables $p_j$ for $j = 1, \ldots, ?$. \begin{enumerate} \item Write the equation for a regression model that includes the possibility of regression lines that are not parallel. \item Make a table with columns showing how the dummy variables are defined. Make $D$ the reference category. Include a wider column in which you show $E(Y|x)$ for each treatment programme. \newpage \item In terms of the $\beta$ coefficients of your model, what null hypothesis would you test to answer each of the following questions? \begin{enumerate} \item Are the four regression lines parallel? \item Are the slopes for treatments $A$, $B$ and $C$ equal? \item Are the slopes for treatments $A$, $B$ and $D$ equal? \item Is there an interaction between treatment programme an initial severity of the injury? \item Holding initial severity of the injury constant at $x=5$ (the definition of a ``moderate" injury), do the treatments differ in their effectiveness? \item Holding initial severity of the injury constant at $x=5$, which is more effective, treatment $A$ or treatment $C$? \end{enumerate} \item Write the last three null hypotheses in matrix form as $H_0: \mathbf{C}\boldsymbol{\beta} = \mathbf{t}$. \end{enumerate} % \pagebreak \item \label{census} Please return to the Census Tract data again. Fit a regression model in which crime rate is a function of \texttt{area}, \texttt{urban}, \texttt{old}, \texttt{docs}, \texttt{beds}, \texttt{hs}, \texttt{labor}, \texttt{income} and \texttt{region} of the country. There are no interactions for now. This is the \emph{full model} in all the analyses that follow. Just so we will be doing things the same way, please make \texttt{region} a factor, and look at help to see how to use the \texttt{labels=} option. If you can't remember what the regions are during the quiz, nobody will tell you. Based on this model, \begin{enumerate} \item What is $k$? The answer is a number. % 11 \item What is $\widehat{\beta}_4$? The answer is a number. % \item Give the test statistic, the degrees of freedom and the $p$-value for each of the following null hypotheses. The answers are numbers from your printout. \begin{enumerate} \item $H_0: \beta_1=\beta_2= \cdots = \beta_{11} = 0$ \item $H_0: \beta_7=0$ \item $H_0: \beta_0=0$ \end{enumerate} \item What proportion of the variation in crime rate is explained by the independent variables in this model? The answer is a number. % 0.4827 \item What is the smallest value of $\widehat{\epsilon}_i$? The answer is a number. % -26.7809 \item What is the largest value of $\widehat{\epsilon}_i$? The answer is a number. % 23.0755 \item Look at the output of \texttt{summary}. For the first entry under ``\texttt{t value}" (that's \texttt{1.502}), what is the null hypothesis? The answer is a symbolic statement involving one or more Greek letters. % H0: beta0=0 \item Look at the $F$ test at the end of the \texttt{summary} output. What is the null hypothesis? The answer is a symbolic statement involving one or more Greek letters. % H0: beta1 = ... = beta11 = 0 \newpage \item Controlling for all the other variables in the model, is percent High School graduates related to crime rate? \begin{enumerate} \item Give the null hypothesis in symbols. \item Give the value of the test statistic. The answer is a number from your printout. \item Give the $p$-value. The answer is a number from your printout. \item Do you reject the null hypothesis at $\alpha = 0.05$? Answer Yes or No. \item Allowing for other variables, census regions with a higher percentage of High School graduates tend to have \underline{~~~~~~~} crime rates. % higher \end{enumerate} \item Controlling for all the other variables in the model, is number of physicians related to crime rate? \begin{enumerate} \item Give the null hypothesis in symbols. \item Give the value of the test statistic. The answer is a number from your printout. \item Give the $p$-value. The answer is a number from your printout. \item Do you reject the null hypothesis at $\alpha = 0.05$? Answer Yes or No. \item Is there enough evidence to conclude that allowing for other variables, number of physicians is related to crime rate? \end{enumerate} \item Controlling for all the other variables in the model, is there a difference in crime rate between the Northeast and North Central regions? \begin{enumerate} \item Give the null hypothesis in symbols. \item Give the value of the test statistic. The answer is a number from your printout. \item Give the $p$-value. The answer is a number from your printout. \item Do you reject the null hypothesis at $\alpha = 0.05$? Answer Yes or No. \item State the conclusion in plain, non-statistical language. If there is a difference, say which region has a higher average crime rate! \end{enumerate} \item Controlling for all the other variables in the model, is there a difference in crime rate between the Northeast and South regions? \begin{enumerate} \item Give the null hypothesis in symbols. \item Give the value of the test statistic. The answer is a number from your printout. \item Give the $p$-value. The answer is a number from your printout. \item Do you reject the null hypothesis at $\alpha = 0.05$? Answer Yes or No. \item State the conclusion in plain, non-statistical language. If there is a difference, say which region has a higher average crime rate! \end{enumerate} \newpage \item Controlling for all the other variables in the model, is there a difference in crime rate between the South and West regions? \begin{enumerate} \item Give the null hypothesis in symbols. \item Give the value of the test statistic. The answer is a number from your printout. \item Give the $p$-value. The answer is a number from your printout. \item Do you reject the null hypothesis at $\alpha = 0.05$? Answer Yes or No. \item State the conclusion in plain, non-statistical language. If there is a difference, say which region has a higher average crime rate! \end{enumerate} \item I think it's remarkable that only one variable apart from region seems to make a difference once you allow for the others. Which one is it? \item But the other variables may be masking each other's relationship when each is controlled for all the others. Please test them all at once, with a view to maybe dropping them and obtaining a simpler model. \begin{enumerate} \item Give the null hypothesis in symbols. \item Give the value of the test statistic. The answer is a number from your printout. \item Give the $p$-value. The answer is a number from your printout. \item Do you reject the null hypothesis at $\alpha = 0.05$? Answer Yes or No. \item What proportion of the remaining variation do these variables explain? \item Is there evidence that, once we control for region and percent High School graduates, that any of these variables is related to the crime rate? \end{enumerate} \item To be continued \ldots \end{enumerate} \textbf{Bring your printout to the quiz.} \end{enumerate} \vspace{40mm} \noindent \begin{center}\begin{tabular}{l} \hspace{6in} \\ \hline \end{tabular}\end{center} This assignment was prepared by \href{http://www.utstat.toronto.edu/~brunner}{Jerry Brunner}, Department of Statistical Sciences, University of Toronto. It is licensed under a \href{http://creativecommons.org/licenses/by-sa/3.0/deed.en_US} {Creative Commons Attribution - ShareAlike 3.0 Unported License}. Use any part of it as you like and share the result freely. The \LaTeX~source code is available from the course website: \href{http://www.utstat.toronto.edu/~brunner/oldclass/302f14} {\small\texttt{http://www.utstat.toronto.edu/$^\sim$brunner/oldclass/302f14}} \end{document} \vspace{5mm} \noindent # R work for STA302f13 Assignment 10 rm(list=ls()) census = read.table("http://www.utstat.toronto.edu/~brunner/302f13/code_n_data/hw/CensusTract.data") attach(census) crimerate = crimes/pop region=factor(region,labels=c("NE","NC","S","W" )) fullmod = lm(crimerate ~ area + urban + docs + beds + hs + region) summary(fullmod) justcovs = lm(crimerate ~ area + urban + docs + beds + hs) justregion = lm(crimerate ~ region) # Other matter controlling for region? anova(justregion,fullmod) # Region controlling for others anova(justcovs,fullmod) # Testing pairwise differences betahat = fullmod$coefficients; betahat #$ V = vcov(fullmod) dfe = fullmod$df.residual #$ # t-tests a23 = rbind(0,0,0,0,0,0,1,-1,0) a24 = rbind(0,0,0,0,0,0,1,0,-1) a34 = rbind(0,0,0,0,0,0,0,1,-1) # NC vs S T23 = as.numeric( t(a23)%*%betahat/sqrt(t(a23)%*%V%*%a23) ) T23; 2*(1-pt(abs(T23),dfe)) # NC vs W T24 = as.numeric( t(a24)%*%betahat/sqrt(t(a24)%*%V%*%a24) ) T24; 2*(1-pt(abs(T24),dfe)) # S vs W T34 = as.numeric( t(a34)%*%betahat/sqrt(t(a34)%*%V%*%a34) ) T34; 2*(1-pt(abs(T34),dfe))