\documentclass[12pt]{article} %\usepackage{amsbsy} % for \boldsymbol and \pmb \usepackage{graphicx} % To include pdf files! \usepackage{amsmath} \usepackage{amsbsy} \usepackage{amsfonts} \usepackage[colorlinks=true, pdfstartview=FitV, linkcolor=blue, citecolor=blue, urlcolor=blue]{hyperref} % For links %\usepackage{fullpage} %\pagestyle{empty} % No page numbers \oddsidemargin=0in % Good for US Letter paper \evensidemargin=0in \textwidth=6.5in \topmargin=-0.8in \headheight=0in \headsep=0.5in \textheight=9.4in \begin{document} %\enlargethispage*{1000 pt} \begin{center} {\Large \textbf{STA 2101/442 Assignment 3}}\footnote{This assignment was prepared by \href{http://www.utstat.toronto.edu/~brunner}{Jerry Brunner}, Department of Statistics, University of Toronto. It is licensed under a \href{http://creativecommons.org/licenses/by-sa/3.0/deed.en_US} {Creative Commons Attribution - ShareAlike 3.0 Unported License}. Use any part of it as you like and share the result freely. The \LaTeX~source code is available from the course website: \href{http://www.utstat.toronto.edu/~brunner/oldclass/appliedf17} {\texttt{http://www.utstat.toronto.edu/$^\sim$brunner/oldclass/appliedf17}}} \vspace{1 mm} \end{center} \noindent Except for Question \ref{mathR}, the questions on this assignment are practice for the quiz on Friday September 29th, and are not to be handed in. Please do the problems using the formula sheet as necessary. A copy of the formula sheet will be distributed with the quiz. \begin{enumerate} \item In simple regression through the origin, there is one explanatory variable and no intercept. The model is $y_i = \beta_1 x_i + \epsilon_i$. \begin{enumerate} \item \label{calc} Find the least squares estimator of $\beta_1$ with calculus. \item This model is a special case of $\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon}$. What is the $\mathbf{X}$ matrix? \item What is $\mathbf{X}^\top \mathbf{X}$? \item What is $\mathbf{X}^\top \mathbf{y}$? \item What is $(\mathbf{X}^\top \mathbf{X})^{-1}$? \item What is $\widehat{\beta}_1 = (\mathbf{X}^\top \mathbf{X})^{-1}\mathbf{X}^\top\mathbf{y}$? Compare this with your answer to~\ref{calc}. \end{enumerate} \item There can even be a regression model with an intercept and no explanatory variables. In this case the model would be $y_i = \beta_0 + \epsilon_i$. Again this is a special case of $\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon}$. \begin{enumerate} \item \label{ybar} Find the least squares estimator of $\beta_0$ with calculus. What's a least-squares estimator again? Find the parameter value(s) that make the $y_i$ observations as close as possible to their expected values. \item What is the $\mathbf{X}$ matrix? \item What is $\mathbf{X}^\top \mathbf{X}$? \item What is $\mathbf{X}^\top \mathbf{y}$? \item What is $(\mathbf{X}^\top \mathbf{X})^{-1}$? \item What is $\widehat{\beta}_0 = (\mathbf{X}^\top \mathbf{X})^{-1}\mathbf{X}^\top\mathbf{y}$? Compare this with your answer to~\ref{ybar}. \end{enumerate} \item \label{decomp} The linear regression model with intercept can be written in scalar form as $y_i = \beta_0 + \beta_1 x_{i,1} + \cdots + \beta_{p-1} x_{i,p-1} + \epsilon_i$. \begin{enumerate} \item Why does the presence of $\beta_0$ guarantee that the sum of residuals $\sum_{i=1}^ne_i = 0$? \item Defining $SSTO=\sum_{i=1}^n(y_i-\overline{y})^2$, $SSR = \sum_{i=1}^n(\widehat{y}_i-\overline{y})^2$ and $SSE=\sum_{i=1}^n(y_i-\widehat{y}_i)^2$, show $SSTO=SSE+SSR$. I find it helpful to switch to matrix notation partway through the calculation. \end{enumerate} \vspace{5mm} \pagebreak \item \label{crimerate} The U.S. Census Bureau divides the United States into small pieces called census tracts; lots of information is collected about each census tract. The census tracts are grouped into four geographic regions: Northeast, North Central, South and West. In one study, the cases were census tracts, the explanatory variables were Region and average income, and the response variable was crime rate, defined as the number of reported serious crimes in a census tract, divided by the number of people in the census tract. \begin{enumerate} \item Write $E(Y|x)$ for a regression model with parallel regression lines. Use indicato dummy variables with an intercept. You do not have to say how your dummy variables are defined. You will do that in the next part. \item Make a table showing how your dummy variables are set up. There should be one row for each region, and a column for each dummy variable. Add a wider column on the right, in which you show $E(Y|x)$. Note that the \emph{symbols} for your dummy variables will not appear in this column. There are examples of this format in the lecture slides and the text.for each region. \item For each of the following questions, give the null hypothesis in terms of the $\beta$ parameters of your regression model. We are not doing one-tailed tests, regardless of how the question is phrased. \begin{enumerate} \item Controlling for income, does average crime rate differ by geographic region? \item Allowing for income, is average crime rate different in the Northeast and North Central regions? \item Correcting for income, is average crime rate different in the Northeast and Western regions? \item Holding income constant, is the crime rate in the South more than the average of the other three regions? \item For a given fixed value of income, is the average crime rate in the Northeast and North Central regions different from the average of the South and West? \item Controlling for geographic region, is crime rate connected to income? \end{enumerate} \end{enumerate} \item Now please answer Question~\ref{crimerate} again using \emph{cell means coding}. That's the dummy variable scheme with an indicator for each category and no intercept. Answer all the part of the question. \item I know you did this already, but show that $\mathbf{X}^\top \mathbf{e} = \mathbf{0}$. \item The preceding problem implies that if a regression model has an intercept, the residuals add to zero. In Question~\ref{decomp}, this was critical to showing $SSTO=SSE+SSR$, so that $R^2 = \frac{SSR}{SSTO}$ makes sense. What about a regression model with cell means coding (and maybe some covariates) and no intercept? Do the residuals still add to zero so that $R^2$ is meaningful? Show that if a linear combination of the columns of the $\mathbf{X}$ matrix equals a column of ones (true for cell means coding), the sum of residuals is zero. Denote the $n \times 1$ column of ones by $\mathbf{j}$, and assume there is some $p \times 1$ vector $\mathbf{a}$ with $\mathbf{Xa}=\mathbf{j}$. Start by writing $\sum_{i=1}^ne_i$ in matrix notation. \pagebreak %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \item \label{mathR} Before the beginning of the Fall term, students in a first-year Calculus class took a diagnostic test with two parts: Pre-calculus and Calculus. Data are in the file \href{http://www.utstat.toronto.edu/~brunner/data/legal/math1.data.txt} {\texttt{math1.data.txt}}. The variables are \begin{itemize} \item Identification code \item Course: 1=Catch-up 2=Mainstream 3=Elite 4=NoResponse \item Score on pre-calculus part of diagnostic test \item Score on calculus part of diagnostic test \item High School GPA \item High School Calculus mark \item High School English mark \item University Calculus mark \item First language \item Sex \end{itemize} These are real data, with some rough edges. Fix them up your way for now. I strongly advise against manually editing the data file. Sooner rather than later, I will post the list of fixes we will all use. You will probably have to re-run your analysis, so of course save the code. Please start by fitting a full model with all the potential explanatory variables. \begin{enumerate} \item Make a table showing the dummy variable coding scheme for course --- that is, the one R is using by default. \item What proportion of the variation in university calculus mark is explained by the variables in this model? The answer is a number from your printout. \item An $F$-test appears in the last line of output from \texttt{summary}. In symbols, what null hypothesis is being tested? \item What was the original sample size? How many cases are being used to fit the full model? \item For each statistically significant $t$ test produced by \texttt{summary} (that is, $H_0$ is rejected at $\alpha = 0.05$), state a conclusion in plain, non-statistical language. The statements would begin with something like ``Allowing for other variables, \ldots." I think the results for \texttt{hsengl} and \texttt{frstlangOther} are interesting. \item Carry out a test of \texttt{course} controlling for other variables. Be ready to give the value of $F$, the degrees of freedom and the $p$-value. What, if anything, do you conclude? \end{enumerate} \textbf{Please bring your printout to the quiz}. \end{enumerate} \end{document} Random explanatory variable question(s) after random vectors. dummy variable coding schemes are "equivalent" \item \begin{enumerate} \item \item \end{enumerate}