% 302f20Assignment9.tex \documentclass[12pt]{article} %\usepackage{amsbsy} % for \boldsymbol and \pmb \usepackage{graphicx} % To include pdf files! \usepackage{amsmath} \usepackage{amsbsy} \usepackage{amsfonts} \usepackage{comment} \usepackage{euscript} % for \EuScript \usepackage[colorlinks=true, pdfstartview=FitV, linkcolor=blue, citecolor=blue, urlcolor=blue]{hyperref} % For links \usepackage{fullpage} %\pagestyle{empty} % No page numbers \begin{document} %\enlargethispage*{1000 pt} \begin{center} {\Large \textbf{STA 302f20 Assignment Nine}\footnote{This assignment was prepared by \href{http://www.utstat.toronto.edu/~brunner}{Jerry Brunner}, Department of Statistical Sciences, University of Toronto. It is licensed under a \href{http://creativecommons.org/licenses/by-sa/3.0/deed.en_US} {Creative Commons Attribution - ShareAlike 3.0 Unported License}. Use any part of it as you like and share the result freely. The \LaTeX~source code is available from the course website: \href{http://www.utstat.toronto.edu/~brunner/oldclass/302f20} {\small\texttt{http://www.utstat.toronto.edu/$^\sim$brunner/oldclass/302f20}}} } \vspace{1 mm} \end{center} \noindent The following problems are not to be handed in. They are preparation for the quiz in tutorial and the final exam. Please try them before looking at the answers. Use the formula sheet. Please remember that the R part (Question~\ref{sales}) is \emph{not a group project}. You may compare numerical answers, but do not show anyone your code or look at anyone else's. \begin{enumerate} \item \label{moresat} In an extended version of the SAT data, the dependent (response) variable is first-year university Grade Point Average (GPA) again. The independent (predictor) variables are \begin{itemize} \item[$x_1=$] Verbal SAT score \item[$x_2=$] Math SAT score \item[$x_3=$] High school Grade Point Average \item[$x_4=$] Mother's education, in years \item[$x_5=$] Father's education, in years \item[$x_6=$] Total family income, \end{itemize} and also Location of the family home: City, Suburbs or Country. \begin{enumerate} \item First, write the regression equation. Use indicator dummy variables with an intercept, and make the regression planes parallel. \item Make a table with one row for each location of the family home, showing how your dummy variables are defined. Make one more column showing $E(y|\mathbf{x})$ for each location. Note that the \emph{symbols} for your dummy variables will not appear in this column. The lecture slides have examples. \item For each of the following questions, do three things: Give the null hypothesis in the form of a statement about the $\beta$ values, Give the $\mathbf{C}$ and $\mathbf{t}$ matrices in $H_0: \mathbf{C}\boldsymbol{\beta} = \mathbf{t}$, and Give $E(y|\mathbf{x})$ for the reduced model (note that expected $y$ for the full model is always the same). \begin{enumerate} \item Correcting for all other variables, is location of the family home related to first-year GPA? \item Controlling for all other variables, is either Verbal SAT score or Math SAT score (or both) related to GPA? \item When you allow for all the other variables, is family income a useful predictor of GPA? \item Controlling for all other variables, does expected GPA change faster as a function of Verbal SAT, or does it change faster as a function of Math SAT? No full versus reduced model for this one. \item Once you correct for the two SAT scores and High School marks, do any of the family variables matter? \item Correcting for all other variables, does expected GPA change faster as a function of Mother's education, or does it change faster as a function of father's education? No full versus reduced model for this one. \item Holding all the other variables constant at fixed values, is Math SAT related to first-year university GPA? \item Controlling for the other variables, is average GPA of students from the suburbs different from average GPA of students from the city? \item Once you allow for location of the family home, do any of the other predictors matter? \end{enumerate} \item Now consider a model with cell means coding (indicator dummy variables and no intercept). The regression planes are still parallel. \begin{enumerate} \item Write $E(y|x)$. \item Make a table. \item What is the null hypothesis you would test to answer this question: Controlling for the other variables, does average GPA differ by location of the family home? \item What is the null hypothesis you would test to answer this question: Controlling for the other variables, is average GPA of students from the suburbs different from average GPA of students from the city? \end{enumerate} \end{enumerate} % End of extended SAT question % This is a good place for 1-1 linear transformations of X. \item It was suggested in lecture that using a different dummy variable coding scheme is just a linear transformation of the $\mathbf{X}$ matrix: $\mathbf{W} = \mathbf{XA}$, where $\mathbf{A}$ is a $(k+1) \times (k+1)$ matrix with an inverse, and $\mathbf{W}$ is the new $\mathbf{X}$ matrix. Suppose you want to switch from cell means coding to indicators with an intercept. Consider the specific case of a single categorical independent variable with three categories, and a single quantitative independent variable. Making the last category the reference category, there is a $4 \times 4$ matrix $\mathbf{A}$ such that \begin{displaymath} \left(\begin{array}{cccc} 1 & 0 & 0 & x_1 \\ 0 & 1 & 0 & x_2 \\ 0 & 0 & 1 & x_3 \\ 1 & 0 & 0 & x_4 \\ \vdots & \vdots & \vdots & \vdots \\ 0 & 1 & 0 & x_n \\ \end{array}\right) \mathbf{A} = \left(\begin{array}{cccc} 1 & 1 & 0 & x_1 \\ 1 & 0 & 1 & x_2 \\ 1 & 0 & 0 & x_3 \\ 1 & 1 & 0 & x_4 \\ \vdots & \vdots & \vdots & \vdots \\ 1 & 0 & 1 & x_n \\ \end{array}\right) \end{displaymath} Give the matrix $\mathbf{A}$. It is a matrix of specific numbers. \item When there is more than one categorical explanatory variable in a regression model, there is no problem if the model has an intercept. But if two categorical variables are represented separately with cell means coding, there is potential trouble. Why? \item Linear transformations of the $\mathbf{X}$ matrix are not limited to switching dummy variable schemes. In general, \begin{eqnarray*} & & \mathbf{y} = \mathbf{X} \boldsymbol{\beta} + \boldsymbol{\epsilon} \\ & \iff & \mathbf{y} = \mathbf{XA A}^{-1} \boldsymbol{\beta} + \boldsymbol{\epsilon} \\ & \iff & \mathbf{y} = \mathbf{W} \boldsymbol{\alpha} + \boldsymbol{\epsilon}, \end{eqnarray*} where $\mathbf{A}$ is a $(k+1) \times (k+1)$ matrix, $\mathbf{W=XA}$ and $\boldsymbol{\alpha} = \mathbf{A}^{-1} \boldsymbol{\beta}$. There is a new vector of regression coefficients because the \emph{meaning} of the regression coefficients changes when the predictor variables are transformed. \begin{enumerate} \item Denoting the least-squares estimate of $\boldsymbol{\alpha}$ by $\widehat{\boldsymbol{\alpha}}$, find a formula for $\widehat{\boldsymbol{\alpha}}$. Simplify. What is its connection to $\widehat{\boldsymbol{\beta}}$? \item What is the vector of predicted $y$ values for the transformed model? How does it compare to $\widehat{\mathbf{y}}$ from the original model? \item Give a null hypothesis equivalent to $H_0: \mathbf{C}\boldsymbol{\beta}=\mathbf{t}$, but in terms of the transformed model. It's $H_0: \mathbf{C}_2\boldsymbol{\alpha}=\mathbf{t}$. What is $\mathbf{C}_2$? \item Compare the $F^*$ statistics for testing $H_0: \mathbf{C}\boldsymbol{\beta}=\mathbf{t}$ and $H_0: \mathbf{C}_2\boldsymbol{\alpha}=\mathbf{t}$. One would hope they are the same. Are they? Show your work. \end{enumerate} % A good quiz or exam question is A = (X'X)^{-1/2} \item You know that if a regression model has an intercept, the residuals add to zero. This yields $SST=SSR+SSE$, and makes $R^2 = \frac{SSR}{SST}$ meaningful. \begin{enumerate} \item \label{modRsq} When a regression model does not have an intercept, software authors apparently still feel the need to report an $R^2$. To do this, they partition not $SST = \sum_{i=1}^n(y_i-\overline{y})^2$, but the sum of squared deviations of the $y_i$ around zero: $\sum_{i=1}^n(y_i-0)^2 = \sum_{i=1}^ny_i^2$. For a model that might or might not have an intercept (it doesn't matter), \begin{enumerate} \item Prove $\sum_{i=1}^ny_i^2 = \mbox{\emph{SSE}} + \sum_{i=1}^n\widehat{y}_i^2$. \item Write a formula for the proportion of $\sum_{i=1}^ny_i^2$ that is explained by the regression. This is the alternative definition of $R^2$ used by R and other software when the model does not have an intercept. It is often quite high compared to typical $R^2$ values. \end{enumerate} \newpage %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \item \label{onecol} It turns out that for some models that do not have intercepts, the residuals still add up to zero. This is attractive because in this case the usual definition of $R^2$ is meaningful, and we are not stuck with the weird modified definition of $R^2$ from Question~\ref{modRsq}. Here is an easy condition to check. Let $\mathbf{j}$ denote an $n \times 1$ column of ones. Show that if there is a $(k+1) \times 1$ vector of constants $\mathbf{v}$ with $\mathbf{Xv}= \mathbf{j}$, then $\sum_{i=1}^n\widehat{\epsilon}_i=0$. Another way to state this is that if there is a linear combination of the columns of $\mathbf{X}$ that equals a column of ones, then the sum of residuals equals zero. Clearly this applies to a model with a categorical explanatory variable and cell means coding. \end{enumerate} \item \label{censustract} The U.S. Census Bureau divides the United States into small pieces called census tracts; lots of information is collected about each census tract. The census tracts are grouped into four geographic regions: North Central, Northeast, South and West. In one study, the cases were census tracts, the explanatory variables were Region and average income, and the response variable was crime rate, defined as the number of reported serious crimes in a census tract, divided by the number of people in the census tract. \begin{enumerate} \item Write $E(y|x)$ for a regression model with \emph{no intercept} and parallel regression lines. You do not have to say how your dummy variables are defined. You will do that in the next part. \item Make a table showing how your dummy variables are set up. There should be one row for each region, and a column for each dummy variable. Add a wider column on the right, in which you show $E(y|x)$. \item For each of the following questions, give the null hypothesis in terms of the $\beta_j$ parameters of your regression model. We are not doing one-tailed tests, regardless of how the question is phrased. \begin{enumerate} \item Controlling for income, does average crime rate differ by geographic region? \item Controlling for income, is average crime rate different in the North Central and Northeast regions? \item Controlling for income, is average crime rate different in the Northeast and Western regions? \item Controlling for income, is the crime rate in the South more than the average of the other three regions? \item Controlling for income, is the average crime rate in the Northeast and North Central regions different from the average of the South and West? \item Controlling for geographic region, is crime rate connected to income? \end{enumerate} \newpage %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \item State why each of the following is a bad way to ask the question. \begin{enumerate} \item Controlling for income, does geographic region affect the average crime rate? \item Allowing for geographic region, does average income have any effect on crime rate? \end{enumerate} \item \label{uneqslopes} Write $E(y|\mathbf{x})$ for a regression model in which the regression lines might not be parallel. This time, use a model with an intercept. Make North Central the reference category; that's what R would do, since it's alphabetically first. \item Make a table showing how the dummy variables are set up. There should be one row for each region, and a column for each dummy variable. Add a wider column on the right, in which you show $E(y|\mathbf{x})$. \item For this new model with possibly unequal slopes, give the null hypothesis you would test in order to answer each question. Write it in scalar form, in terms of the $\beta_j$ parameters. \begin{enumerate} \item Are the four regression lines parallel in the population? \item Is there an interaction between average income and geographic region? \item Does the relationship of average income to crime rate depend on geographic region? \item Do regional differences in average crime rate depend on the average income in the census tract? \item Is the slope of the line relating average income to expected crime rate different for the North Central and Northeast regions? \item Is the slope of the line relating average income to crime rate different for the North Central and South regions? \item Is the slope of the line relating average income to crime rate different for the North Central and West regions? \item Is the slope of the line relating average income to crime rate different for the Northeast and South regions? \item Is the slope of the line relating average income to crime rate different for the Northeast and West regions? \item Is the slope of the line relating average income to crime rate different for the South and West regions? \item Is average income related to crime rate for the South region? This is equivalent to asking if the slope of the regression line for the South region is different from zero. \item Is average income related to crime rate for the Northeast or South region (or both)? This is one test. \end{enumerate} \end{enumerate} % End of census tract question \newpage %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \item The usual advice is that when your regression model contains product terms to represent interactions, you should be sure to include the variables you are multiplying together. Usually, this keeps you out of trouble. \begin{enumerate} \item In the Census Tract problem (Question~\ref{censustract}), suppose you have product terms to represent the interaction as in part~\ref{uneqslopes}, but you omit the $x$ variable. Make a table. What is this model saying? \item Again in the Census Tract problem, suppose you have product terms to represent the interaction as in part~\ref{uneqslopes}, but you omit the dummy variables for geographic region. Make a table. What is this model saying? \item Staying with the Census Tract problem, suppose you use a model with no intercept and cell means coding. Try representing the interaction with four product terms. Write $E(y|\mathbf{x})$. \begin{enumerate} \item The regression coefficients of this model cannot possibly be one-to-one with the regression coefficients of the model with product terms and an intercept. Why? \item $(\mathbf{X^\prime X})^{-1}$ does not exist. Why? \item Try using only three of the product terms. Just leave one out. Make a table. Is this better? \item Instead of omitting one of the product terms, omit the $x$ variable (income) from the model, leaving the product terms in. Make a table. Does this work? \end{enumerate} \end{enumerate} \newpage %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \item \label{sales} For this problem, you will \emph{not} be asked to upload a file with your complete R input and output. Instead, you'll do the R work during the quiz, and upload just what you did. Of course, if you do the problem below in advance and have your code handy, it will be a lot faster and easier. Telephone sales representatives use computer software to help them locate potential customers, answer questions, take credit card information and place orders. Twelve sales representatives were randomly assigned to each of three new software packages the company was thinking of purchasing. The data for each sales representative include the software package (1, 2 or 3), sales last quarter with the old software, and sales this quarter with one of the new software packages. Sales are in number of units sold. The data are available \href{http://www.utstat.toronto.edu/~brunner/data/legal/sales.data.txt} {here}. The URL is \begin{center} \verb+http://www.utstat.toronto.edu/~brunner/data/legal/sales.data.txt+ \end{center} The response variable is sales this quarter. \begin{enumerate} \item Fit a model in which sales last quarter is ignored. This is very different from controlling for it. We want to know whether software package has any effect on sales. Why is it okay to use the word ``effect?" \begin{enumerate} \item Write $E(y|\mathbf{x})$. \item What proportion of the variation in sales this quarter is explained by software package? The answer is a number from the output of \texttt{summary}. \item What is the null hypothesis for testing whether software package has any effect on sales? Give the answer in terms of Greek letters from the regression model. \item Give the test statistic. The answer is a number from the output of \texttt{summary}. \item Give the $p$-value. The answer is a number from the output of \texttt{summary}. The $p$-value is not the same as the test statistic. \item Do you reject $H_0$ at $\alpha=0.05$? Answer Yes or No. \item Are the results statistically significant at the 0.05 level? Answer Yes or No. \item Give the $p$-value for each pairwise comparison of software packages: That's 1 vs.~2, 1 vs.~3 and 2 vs.~3. Don't bother with a Bonferroni correction. \item In plain, non-statistical language, what do you conclude from this analysis? \end{enumerate} % End software package only (covariate ignored) \pagebreak %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \item Now fit a model with software package and sales last quarter as the explanatory variables, and sales this quarter as the response variable. There are no interaction terms yet. \begin{enumerate} \item Write $E(y|\mathbf{x})$. Make sure the variables are in the same order here and in your R program. \item What is the null hypothesis for testing whether software package has any effect on sales this quarter once you control for sales last quarter? Give the answer in terms of Greek letters from the regression model. \item Give the test statistic. The answer is a number. There is more than one good way to compute it. \item Give the $p$-value. The answer is a number. \item Do you reject $H_0$ at $\alpha=0.05$? Answer Yes or No. \item Are the results statistically significant at the 0.05 level? Answer Yes or No. \item What proportion of the \emph{remaining} variation in sales this quarter is explained by software package once you allow for sales last quarter? % \item Give the Bonferroni-corrected $p$-value for each pairwise comparison of software packages controlling for sales last quarter: That's 1 vs.~2, 1 vs.~3 and 2 vs.~3. \item Give the $p$-value for each pairwise comparison of software packages: That's 1 vs.~2, 1 vs.~3 and 2 vs.~3. Don't bother with a Bonferroni correction. \item In plain, non-statistical language, what do you conclude from this analysis? \end{enumerate} % End equal slopes model \item Fit a full model in which the slopes and intercepts of the regression lines relating sales last quarter to sales this quarter might depend on the kind of software the sales representatives are using. \begin{enumerate} \item Write $E(y|\mathbf{x})$. Make sure the explanatory variables are in the same order here and in your R code. \item What is the null hypothesis for testing whether the three slopes are equal? Give the answer in terms of Greek letters from the regression model. \item What is the null hypothesis for testing whether the effect of software program on sales this quarter depends on sales last quarter? Give the answer in terms of Greek letters from the regression model. \item Carry out an $F$-test to determine whether the effect of software type on sales depends on the representative's performance last quarter. Be able to state your conclusion in plain, non-statistical language. \begin{enumerate} \item Give the test statistic. The answer is a number. \item Give the $p$-value. The answer is a number. \item Do you reject $H_0$ at $\alpha=0.05$? Answer Yes or No. \item Are the results statistically significant at the 0.05 level? Answer Yes or No. \end{enumerate} \item Estimate the slopes and intercepts of the three regression lines. You could base the estimates on numbers from \texttt{summary}, but instead please use the \texttt{coefficients} function to obtain $\widehat{\boldsymbol{\beta}}$ with more numerical accuracy. I don't see how you can do this without making a table. \item Test whether the slope is different from zero for software package two. \begin{enumerate} \item State the null hypothesis in Greek letters. \item Give the test statistic. The answer is a number. \item Give the $p$-value. The answer is a number. \item Do you reject $H_0$ at $\alpha=0.05$? Answer Yes or No. \item Are the results statistically significant at the 0.05 level? Answer Yes or No. \end{enumerate} % \newpage %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \item Carry out tests to answer these questions. If they are already on the output of \texttt{summary}, use that. \begin{enumerate} \item Are the slopes for Software 1 and 2 different? Give the uncorrected $p$-value. \item Are the slopes for Software 1 and 3 different? Give the uncorrected $p$-value. \item Are the slopes for Software 2 and 3 different? Give the uncorrected $p$-value. \end{enumerate} Protecting the three tests with a Bonferroni correction at the joint 0.05 significance level, what do you conclude? Plain language is not necessary, but you should say what happened. \end{enumerate} % End unequal slopes model \end{enumerate} % End computer question % And one polynomial regression question. \item Here is one last example of a model with a quantitative explanatory variable called $x$, and a categorical explanatory variable with three categories. This time we want a polynomial model, with a potentially different quadratic equation in $x$ for each category of the categorical predictor. Accordingly, we form interaction terms by multiplying both $x$ and $x^2$ by each dummy variable, as follows. \begin{eqnarray*} E(y|\mathbf{x}) &=& \beta_0 + \beta_1 d_1 + \beta_2 d_2 + \beta_3x + \beta_4 x^2 \\ & + & \beta_5 d_1x + \beta_6 d_2x + \beta_7 d_1x^2 + \beta_8 d_2x^2 \end{eqnarray*} \begin{enumerate} \item Please complete the table below. Collect terms and make it look nice. The \emph{symbols} $d_1$ and $d_2$ should not appear in your answer, because they are either zero or one. {\begin{center} \renewcommand{\arraystretch}{1.5} \begin{tabular}{|c|c|c|c|} \hline Category & $d_1$ & $d_2$ & $E(y|\mathbf{x})$ \\ \hline $A$ & 1 & 0 & \hspace{100mm} \\ \hline $B$ & 0 & 1 & \\ \hline $C$ & 0 & 0 & \\ \hline \end{tabular} \renewcommand{\arraystretch}{1.0} \end{center}} \item Suppose you wanted to test whether the three quadratic curves are parallel. What is the null hypothesis? \end{enumerate} % End of polynomial regression with interactions \end{enumerate} % End of all the questions \end{document} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % \vspace{30mm} \hrule \vspace{30mm} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % \vspace{30mm}\hrule\vspace{30mm}