\documentclass[12pt]{article} %\usepackage{amsbsy} % for \boldsymbol and \pmb \usepackage{graphicx} % To include pdf files! \usepackage{amsmath} \usepackage{amsbsy} \usepackage{amsfonts} % for \mathbb{R} The set of reals \usepackage[colorlinks=true, pdfstartview=FitV, linkcolor=blue, citecolor=blue, urlcolor=blue]{hyperref} % For links \usepackage{fullpage} %\pagestyle{empty} % No page numbers \begin{document} %\enlargethispage*{1000 pt} \begin{center} {\Large \textbf{STA 302f17 Assignment Ten}}\footnote{Copyright information is at the end of the last page.} \vspace{1 mm} \end{center} \noindent These questions are preparation for the quiz in tutorial on Thursday November 30th, and are not to be handed in. \begin{enumerate} %\item \label{xbarybar} For a regression model with an intercept, show that $\overline{y} = b_0 + b_1 \overline{x}_1 + \cdots + b_k \overline{x}_k$. % Done later, more quietly. Maybe quiz. \item \label{centered} This question explores the practice of centering quantitative independent variables in a regression by subtracting off the mean. Geometrically, this should not alter the configuration of data points in the multi-dimensional scatterplot. All it does is shift the axes. Thus, the intercept of the least squares plane should change, but the slopes should not. Consider a simple experimental study with an experimental group, a control group and a single quantitative covariate. Independently for $i=1, \ldots, n$ let \begin{displaymath} y_i = \beta_0 + \beta_1 x_i + \beta_2 d_i + \epsilon_i, \end{displaymath} where $x_i$ is the covariate and $d_i$ is an indicator dummy variable for the experimental group. If the covariate is ``centered," the model can be written \begin{displaymath} y_i = \beta_0^* + \beta_1^* (x_i-\overline{x}) + \beta_2^* d_i + \epsilon_i, \end{displaymath} where $\overline{x} = \frac{1}{n}\sum_{i=1}^n x_i$. \begin{enumerate} \item Express the $\beta^*$ quantities in terms of the original $\beta$ quantities. \item \label{general} Let's generalize this. For the general linear model in matrix form suppose $\boldsymbol{\beta}^* = A\boldsymbol{\beta}$, where $A$ is a square matrix with an inverse. This makes $\boldsymbol{\beta}^*$ a one-to-one function of $\boldsymbol{\beta}$. Of course $X$ is affected as well. Show that $\mathbf{b}^* = A\mathbf{b}$. \item Give the matrix $A$ for this $k=2$ model. \item If the independent variable $x$ is centered, what is $E(y|x)$ for the experimental group, and what is $E(y|x)$ for the control group? Give your answer in terms of the $\beta^*$ values of the centered model. \item In terms of the $\beta^*$ values of the centered model, give $E(y|x)$ for the experimental group and the control group when $x$ equals the average (sample mean) value. \end{enumerate} \pagebreak \item In the following model, there are $k$ quantitative independent variables. The un-centered version is \begin{displaymath} y_i = \beta_0 + \beta_1 x_{i,1} + \ldots + \beta_{k} x_{i,k} + \epsilon_i, \end{displaymath} and the centered version is \begin{displaymath} y_i = \beta_0^* + \beta_1^* (x_{i,1}-\overline{x}_1) + \ldots + \beta_{k}^* (x_{i,k}-\overline{x}_{k}) + \epsilon_i, \end{displaymath} where $\overline{x}_j = \frac{1}{n}\sum_{i=1}^n x_{i,j}$ for $j = 1, \ldots, k$. \begin{enumerate} \item What is $\beta_0^*$ in terms of the $\beta$ quantities? Show your work. \item In terms of the $\beta$ quantities, what is $\beta_j^*$ for $j=1, \ldots, j$? \item What is $b_0^*$ in terms of the $b$ quantities? Note that Problem~\ref{general} lets you just write this down. \item In terms of the $b$ quantities, what is $b_j^*$ for $j=1, \ldots, j$? \item Referring again to Problem~\ref{general}, give the $A$ matrix for this $k$-variable model. \item Using $\sum_{i=1}^n\widehat{y}_i = \sum_{i=1}^ny_i$, show that $b_0^* = \overline{y}$. \end{enumerate} % Random IV \item In the usual multiple regression model, the $X$ matrix is an $n \times (k+1)$ matrix of known constants. But in practice, the independent variables are often random and not fixed. Clearly, if the model holds \emph{conditionally} upon the values of the independent variables, then all the usual results hold, again conditionally upon the particular values of the independent variables. The probabilities (for example, $p$-values) are conditional probabilities, and the $F$ statistic does not have an $F$ distribution, but a conditional $F$ distribution, given $\mathcal{X}=X$. Here, the $n \times (k+1)$ matrix $\mathcal{X}$ is used to denote the matrix containing the random independent variables. It does not have to be \emph{all} random. For example the first column might contain only ones if the model has an intercept. \begin{enumerate} \item Show that the least-squares estimator $(X^\prime X)^{-1} X^\prime \mathbf{y}$ is conditionally unbiased. You've done this before. \item Show that $\mathbf{b} = (\mathcal{X}^\prime\mathcal{X}) \mathcal{X}^\prime \mathbf{y}$ is also unbiased unconditionally. \item A similar calculation applies to the significance level of a hypothesis test. Let $F$ be the test statistic (say for an $F$-test comparing full and reduced models), and $f_c$ be the critical value. If the null hypothesis is true, then the test is size $\alpha$, conditionally upon the independent variable values. That is, $P(F>f_c|\mathcal{X}=X)=\alpha$. Using the Law of Total Probability (see lecture slides), find the \emph{unconditional} probability of a Type I error. Assume that the independent variables are discrete, so you can write a multiple sum. \end{enumerate} \pagebreak \item Consider the following model with random independent variables. Independently for $i=1, \ldots, n$, \begin{eqnarray*} y_i &=& \alpha + \beta_1 x_{i1} + \cdots + \beta_k x_{ik} + \epsilon_i \\ &=& \alpha + \boldsymbol{\beta}^\prime \mathbf{x}_i + \epsilon_i, \end{eqnarray*} where \begin{displaymath} \mathbf{x}_i = \left( \begin{array}{c} x_{i1} \\ \vdots \\ x_{ik} \end{array} \right) \end{displaymath} and $\mathbf{x}_i$ is independent of $\epsilon_i$. Note that in this notation, $\alpha$ is the intercept, and $\boldsymbol{\beta}$ does not include the intercept. The ``independent" variables $\mathbf{x}_i = (x_{i1}, \ldots, x_{ik})^\prime$ are not statistically independent. They have the symmetric and positive definite $k \times k$ covariance matrix $\Sigma_x = [\sigma_{ij}]$, which need not be diagonal. They also have the $k \times 1$ vector of expected values $\boldsymbol{\mu}_x = (\mu_1, \ldots, \mu_k)^\prime$. \begin{enumerate} % \item What is $Cov(x_{i1},x_i)$? Express your answer in terms of $\beta$ and $\sigma_{ij}$ quantities. Show your work. \item Let $\Sigma_{xy}$ denote the $k \times 1$ matrix of covariances between $y_i$ and $x_{ij}$ for $j=1, \ldots, k$. Calculate $\Sigma_{xy} = cov(\mathbf{x}_i,y_i)$. Stay with matrix notation and don't expand. %, obtaining $\boldsymbol{\Sigma}_{xy} = \boldsymbol{\Sigma}_x \boldsymbol{\beta}$. \item From the equation you just obtained, solve for $\boldsymbol{\beta}$ in terms of $\Sigma_x$ and $\Sigma_{xy}$. \item Based on your answer to the last part and and letting $\widehat{\Sigma}_x$ and $\widehat{\Sigma}_{xy}$ denote matrices of \emph{sample} variances and covariances, what would be a reasonable estimator of $\boldsymbol{\beta}$ that you could calculate from sample data? If you are not sure, check the lecture notes in which we centered $y_i$ and well as the independent variables, and fit a regression through the origin. \end{enumerate} \pagebreak \item In the following regression model, the independent variables $x_1$ and $x_2$ are random variables. The true model is \begin{displaymath} y_i = \beta_0 + \beta_1 x_{i,1} + \beta_2 x_{i,2} + \epsilon_i, \end{displaymath} independently for $i= 1, \ldots, n$, where $\epsilon_i \sim N(0,\sigma^2)$. The mean and covariance matrix of the independent variables are given by \begin{displaymath} E\left( \begin{array}{c} x_{i,1} \\ x_{i,2} \end{array} \right) = \left( \begin{array}{c} \mu_1 \\ \mu_2 \end{array} \right) \mbox{~~ and ~~} cov\left( \begin{array}{c} x_{i,1} \\ x_{i,2} \end{array} \right) = \left( \begin{array}{rr} \phi_{11} & \phi_{12} \\ \phi_{12} & \phi_{22} \end{array} \right) \end{displaymath} Unfortunately $x_{i,2}$, which has an impact on $y_i$ and is correlated with $x_{i,1}$, is not part of the data set. Since $x_{i,2}$ is not observed, it is absorbed by the intercept and error term, as follows. \begin{eqnarray*} y_i &=& \beta_0 + \beta_1 x_{i,1} + \beta_2 x_{i,2} + \epsilon_i \\ &=& (\beta_0 + \beta_2\mu_2) + \beta_1 x_{i,1} + (\beta_2 x_{i,2} - \beta_2 \mu_2 + \epsilon_i) \\ &=& \beta^*_0 + \beta_1 x_{i,1} + \epsilon^*_i. \end{eqnarray*} It was necessary to add and subtract $\beta_2 \mu_2$ in order to obtain $E(\epsilon^*_i)=0$. And of course there could be more than one omitted variable. They would all get swallowed by the intercept and error term, the garbage bins of regression analysis. \begin{enumerate} \item What is $Cov(x_{i,1},\epsilon^*_i)$? This is a scalar calculation. \item Calculate $Cov(x_{i,1},y_i)$; it's easier if you use the starred version of the model. This is another scalar calculation. Is it possible to have non-zero covariance between $x_{i,1}$ and $y_i$ when $\beta_1=0$? \item Suppose we want to estimate $\beta_1$ using the usual least squares estimator $b_1$ (see formula sheet). As $n \rightarrow \infty$, does $b_1 \rightarrow \beta_1$? You may use the fact that like sample means, sample variances and covariances converge to the corresponding Greek-letter versions as $n \rightarrow \infty$ (except possibly on a set of probability zero) like ordinary limits, and all the usual rules of limits apply. So for example, defining $\widehat{\sigma}_{xy}$ as $\frac{1}{n-1}\sum_{i=1}^n(x_{i,1}-\overline{x}_1)(y_i-\overline{y})$, we have $\widehat{\sigma}_{xy} \rightarrow Cov(x_i,y_i)$. \end{enumerate} % \newpage \end{enumerate} % End of assignment % \textbf{To be continued.} \vspace{5mm} \noindent \begin{center}\begin{tabular}{l} \hspace{6in} \\ \hline \end{tabular}\end{center} This assignment was prepared by \href{http://www.utstat.toronto.edu/~brunner}{Jerry Brunner}, Department of Statistical Sciences, University of Toronto. It is licensed under a \href{http://creativecommons.org/licenses/by-sa/3.0/deed.en_US} {Creative Commons Attribution - ShareAlike 3.0 Unported License}. Use any part of it as you like and share the result freely. The \LaTeX~source code is available from the course website: \href{http://www.utstat.toronto.edu/~brunner/oldclass/302f17} {\small\texttt{http://www.utstat.toronto.edu/$^\sim$brunner/oldclass/302f17}} \end{document}