\documentclass[11pt]{article} %\usepackage{amsbsy} % for \boldsymbol and \pmb \usepackage{graphicx} % To include pdf files! \usepackage{amsmath} \usepackage{amsbsy} \usepackage{amsfonts} % for \mathbb{R} The set of reals \usepackage[colorlinks=true, pdfstartview=FitV, linkcolor=blue, citecolor=blue, urlcolor=blue]{hyperref} % For links \usepackage{fullpage} %\pagestyle{empty} % No page numbers \begin{document} %\enlargethispage*{1000 pt} \begin{center} {\Large \textbf{STA 302f16 Assignment Eleven}}\footnote{Copyright information is at the end of the last page.} \vspace{1 mm} \end{center} \noindent Except for Question~\ref{sales}, these questions are preparation for the quiz in tutorial on Thursday December 1st, and are not to be handed in. Please bring your printout for Problem~\ref{sales} to the quiz.% Do not write anything on the printout in advance of the quiz, except possibly your name and student number. \begin{enumerate} \item \label{sales} Telephone sales representatives use computer software to help them locate potential customers, answer questions, take credit card information and place orders. Twelve sales representatives were randomly assigned to each of three new software packages the company was thinking of purchasing. The data for each sales representative include the software package (1, 2 or 3), sales last quarter with the old software, and sales this quarter with one of the new software packages. Sales are in number of units sold. The data are in \href{http://www.utstat.toronto.edu/~brunner/data/legal/sales.data.txt} {\texttt{sales.data.txt}}. Get the data with {\footnotesize \begin{verbatim} sales = read.table("http://www.utstat.toronto.edu/~brunner/data/legal/sales.data.txt",header=T). \end{verbatim} } % End size The independent and dependent variables are what you would think. \begin{enumerate} \item Fit a full model in which the slopes and intercepts of the regression lines relating sales last quarter to sales this quarter might depend on the kind of software the sales representatives are using. \item Carry out an ordinary $F$-test to determine whether the effect of software type on sales depends on the representative's performance last quarter. Be able to state your conclusion in plain, non-statistical language. \item Estimate the slopes of the three regression lines. Make sure these numbers are on your printout. I don't see how you can do this without making a table. \item Carry out tests to answer these questions. If they are already on the output of \texttt{summary}, use that. \begin{enumerate} \item Are the slopes for Software 1 and 2 different? \item Are the slopes for Software 1 and 3 different? \item Are the slopes for Software 2 and 3 different? \end{enumerate} Protecting the three tests with a Bonferroni correction at the joint 0.05 significance level, what do you conclude? Plain language is not necessary, but you should say what happened. \item \label{diffatmean} The average (sample mean) performance last quarter was 76.56 (please use exactly this number). We are interested in whether the three software packages differ in their effectiveness for sales representatives with average performance last quarter. \begin{enumerate} \item Estimate expected performance this quarter for sales representatives with average performance last quarter. These three numbers should appear on your printout. \item State the null hypothesis in symbols. \item Carry out the $F$-test. % p = 0.5488 \item In plain language, what do you conclude? \end{enumerate} \end{enumerate} \item \label{centered} This question explores the practice of centering quantitative independent variables in a regression by subtracting off the mean. Geometrically, this should not alter the configuration of data points in the multi-dimensional scatterplot. All it does is shift the axes. Thus, the intercept of the least squares plane should change, but the slopes should not. \begin{enumerate} \item Consider a simple experimental study with an experimental group, a control group and a single quantitative covariate. Independently for $i=1, \ldots, n$ let \begin{displaymath} y_i = \beta_0 + \beta_1 x_i + \beta_2 d_i + \epsilon_i, \end{displaymath} where $x_i$ is the covariate and $d_i$ is an indicator dummy variable for the experimental group. If the covariate is ``centered," the model can be written \begin{displaymath} y_i = \beta_0^* + \beta_1^* (x_i-\overline{x}) + \beta_2^* d_i + \epsilon_i, \end{displaymath} where $\overline{x} = \frac{1}{n}\sum_{i=1}^n x_i$. \begin{enumerate} \item Express the $\beta^*$ quantities in terms of the original $\beta$ quantities. \item Let's generalize this. For the general linear model in matrix form suppose $\boldsymbol{\beta}^* = \mathbf{A}\boldsymbol{\beta}$, where $\mathbf{A}$ is a square matrix with an inverse. This makes $\boldsymbol{\beta}^*$ a one-to-one function of $\boldsymbol{\beta}$. Of course $X$ is affected as well. Show that $\mathbf{b}^* = \mathbf{A}\mathbf{b}$. \item Give the matrix $\mathbf{A}$ for this $p=3$ model. \item If the data are centered, what is $E(y|x)$ for the experimental group, and what is $E(y|x)$ for the control group? \end{enumerate} \item In the following model, there are $k$ quantitative independent variables. The un-centered version is \begin{displaymath} y_i = \beta_0 + \beta_1 x_{i,1} + \ldots + \beta_{k} x_{i,k} + \epsilon_i, \end{displaymath} and the centered version is \begin{displaymath} y_i = \beta_0^* + \beta_1^* (x_{i,1}-\overline{x}_1) + \ldots + \beta_{k}^* (x_{i,k}-\overline{x}_{k}) + \epsilon_i, \end{displaymath} where $\overline{x}_j = \frac{1}{n}\sum_{i=1}^n x_{i,j}$ for $j = 1, \ldots, k$. \begin{enumerate} \item What is $\beta_0^*$ in terms of the $\beta$ quantities? \item What is $\beta_j^*$ in terms of the $\beta$ quantities? \item What is $\widehat{\beta}_0$ in terms of the $\widehat{\beta}^*$ quantities? \item Using $\sum_{i=1}^n\widehat{y}_i = \sum_{i=1}^ny_i$, show that $\widehat{\beta}_0^* = \overline{y}$. \end{enumerate} % \newpage \item Now consider again the study with an experimental group, a control group and a single covariate. This time the interaction is included. \begin{displaymath} y_i = \beta_0 + \beta_1 x_i + \beta_2 d_i + \beta_3 x_id_i + \epsilon_i \end{displaymath} The centered version is \begin{displaymath} y_i = \beta_0^* + \beta_1^* (x_i-\overline{x}) + \beta_2^* d_i + \beta_3^* (x_i-\overline{x})d_i + \epsilon_i \end{displaymath} \begin{enumerate} \item Express the $\beta^*$ quantities from the centered model in terms of the $\beta$ quantities from the un-centered model. Is the correspondence one to one? \item \label{difatmean} For the un-centered model, what is the difference between $E(y|X=\overline{x})$ for the experimental group and $E(y|X=\overline{x})$ for the control group? \item What is the difference between intercepts for the centered model? Compare this to your answer to Question~\ref{difatmean}. \end{enumerate} \end{enumerate} \item \label{gls} In the regression model $\mathbf{y} = X \boldsymbol{\beta} + \boldsymbol{\epsilon}$, let $\boldsymbol{\epsilon} \sim N_n(\mathbf{0},\sigma^2\Omega)$, with $\Omega$ a \emph{known} symmetric positive definite matrix. \begin{enumerate} \item Is $\mathbf{b}$ still an unbiased estimator of $\boldsymbol{\beta}$ for this problem? \item What is $cov(\mathbf{b})$ for this problem? \item Multiply $\mathbf{y} = X \boldsymbol{\beta} + \boldsymbol{\epsilon}$ on the left by $\Omega^{-1/2}$, obtaining $\mathbf{y}^* = X^* \boldsymbol{\beta} + \boldsymbol{\epsilon}^*$. What is the distribution of $\boldsymbol{\epsilon}^*$? (Note that the meaning of the ``*" symbol is different from its meaning in Question~\ref{centered}, except that in both cases it refers to a transformed version.) \item Substituting $X^*$ and $\mathbf{y}^*$ into the formulas for $\mathbf{b}$, obtain the generalized least squares estimate $\mathbf{b}_{gls} = (X^\prime\Omega^{-1}X)^{-1} X^\prime \Omega^{-1} \mathbf{y}$ on page 133 of the textbook. If you look in the textbook, I think you will appreciate the notation we are using. \item What is the distribution of $\mathbf{b}_{gls}$? Show your work. All you have to do is calculate the expected value and covariance matrix, but \emph{why}? What specific fact on the formula sheet are you using? \item Just realize that there's more. You could obtain formulas for the starred versions of $H$, $\widehat{\mathbf{y}}$, $\mathbf{e}$, and $F$ statistic for the general linear test. \end{enumerate} \item For a very simple aggregated data set, our data are a collection of sample means $\overline{y}_1, \ldots, \overline{y}_n$. Data values in the \emph{unaggregated} data set come from a distribution with common mean $\mu$ and common variance $\sigma^2$. Sample mean $i$ is based on $m_i$ observations, so that (approximately by the Central Limit Theorem), $\overline{y}_i \sim N(\mu,\frac{\sigma^2}{m_i})$. \begin{enumerate} \item One could estimate $\mu$ with the arithmetic mean of the sample means. Is this estimator unbiased? What is its variance? \item Start with the regression-like equation $\overline{y}_i = \mu + \epsilon_i$, where $\epsilon_i \sim N(\mu,\frac{\sigma^2}{m_i})$. Multiply both sides by $\sqrt{n_i}$, obtaining a starred version of the regression equation. Give the generalized (weighted) least squares estimate of $\mu$. \item If you had access to the unaggregated data, how would you estimate $\mu$? What is the connection of this statistic to the weighted least squares estimate? \end{enumerate} \end{enumerate} % End of assignment \noindent Please bring your printout for Question~\ref{sales} to the quiz. \textbf{Your printout should show \emph{all} R input and output, and \emph{only} R input and output}. Do not write anything on your printouts except your name and student number. \vspace{5mm} \noindent \begin{center}\begin{tabular}{l} \hspace{6in} \\ \hline \end{tabular}\end{center} This assignment was prepared by \href{http://www.utstat.toronto.edu/~brunner}{Jerry Brunner}, Department of Statistical Sciences, University of Toronto. It is licensed under a \href{http://creativecommons.org/licenses/by-sa/3.0/deed.en_US} {Creative Commons Attribution - ShareAlike 3.0 Unported License}. Use any part of it as you like and share the result freely. The \LaTeX~source code is available from the course website: \href{http://www.utstat.toronto.edu/~brunner/oldclass/302f16} {\small\texttt{http://www.utstat.toronto.edu/$^\sim$brunner/oldclass/302f16}} \end{document}