\documentclass[12pt]{article} %\usepackage{amsbsy} % for \boldsymbol and \pmb \usepackage{graphicx} % To include pdf files! \usepackage{amsmath} \usepackage{amsbsy} \usepackage{amsfonts} \usepackage[colorlinks=true, pdfstartview=FitV, linkcolor=blue, citecolor=blue, urlcolor=blue]{hyperref} % For links \usepackage{fullpage} \topmargin=-.3in \textheight=9.4in %\pagestyle{empty} % No page numbers \begin{document} %\enlargethispage*{1000 pt} \begin{center} {\Large \textbf{STA 2101/442 Assignment Ten}}\footnote{Copyright information is at the end of the last page.} \vspace{1 mm} \end{center} \vspace{3mm} \noindent Please bring printouts of your complete SAS log and list files for Question~\ref{bweight} to the quiz; PDF output counts as a list file. Note that the log and list files \emph{must be from the same run of SAS}. The non-computer questions are just practice for the quiz, and are not to be handed in. \begin{enumerate} \item \label{workout} In a study comparing the effectiveness of different exercise programmes, volunteers were randomly assigned to one of three exercise programmes ($A$, $B$, $C$) or put on a waiting list and told to work out on their own. Aerobic capacity is the body's ability to process oxygen. Aerobic capacity was measured before and after 6 months of participation in the program (or 6 months of being on the waiting list). The response variable was improvement in aerobic capacity. The explanatory variables were age (a covariate) and treatment group. \begin{enumerate} \item First consider a regression model with an intercept, and no interaction between age and treatment group. \begin{enumerate} \item Make a table showing how you would set up indicator dummy variables for treatment group. Make Waiting List the reference category \item Write the regression model. Please use $x$ for age, and make its regression coefficient $\beta_1$. \item In terms of $\beta$ values, what null hypothesis would you test to find out whether, allowing for age, the three exercise programmes differ in their effectiveness? \item Write the null hypothesis for the preceding question as $H_0: \mathbf{L}\boldsymbol{\beta}=\mathbf{0}$. Just give the $\mathbf{L}$ matrix. \item In terms of $\beta$ values, what null hypothesis would you test to find out whether Programme $B$ was better than the waiting list? \item In terms of $\beta$ values, what null hypothesis would you test to find out whether Programmes $A$ and $B$ differ in their effectiveness? \item Suppose you wanted to estimate the difference in average benefit between programmes $A$ and $C$ for a 27 year old participant. Give your answer in terms of $\widehat{\beta}$ values. \item Is it safe to assume that age is independent of the other explanatory variables? Answer Yes or No and briefly explain. \end{enumerate} % \newpage \item Now consider a regression model with an intercept and the interaction (actually a set of interactions) between age and treatment. \begin{enumerate} \item Write the regression model. Make it an extension of your earlier model. \item Suppose you wanted to know whether the slopes of the 4 regression lines were equal. In terms of $\beta$ values, what null hypothesis would you test? \item Suppose you wanted to know whether any differences among mean improvement in the four treatment conditions depends on the participant's age. In terms of $\beta$ values, what null hypothesis would you test? \item Write the null hypothesis for the preceding question as $H_0: \mathbf{L}\boldsymbol{\beta}=\mathbf{0}$. Just give the $\mathbf{L}$ matrix. It is $r \times p$. What is $r$? What is $p$? \item Suppose you wanted to know whether the difference in effectiveness between Programme $A$ and the Waiting List depends on the participant's age. In terms of $\beta$ values, what null hypothesis would you test? \item Suppose you wanted to \emph{estimate} the difference in average benefit between programmes $A$ and $C$ for a 27 year old participant. Give your answer in terms of $\widehat{\beta}$ values. \end{enumerate} \item Now consider a regression model \emph{without} an intercept, but \emph{with} possibly unequal slopes. Make a table to show how the dummy variables could be set up, and write the regression model. Again, please use $x$ for age and make its regression coefficient $\beta_1$. For each treatment condition, what is the conditional expected value of $Y$? The answer is in terms of $x$ and the $\beta$ values. Please put these values as the last column of your table. \begin{enumerate} \item Suppose you wanted to know whether the slopes of the 4 regression lines were equal. In terms of $\beta$ values, what null hypothesis would you test? \item Suppose you wanted to know whether any differences among mean improvement in the four treatment conditions depends on the participant's age. In terms of $\beta$ values, what null hypothesis would you test? \item Write the null hypothesis for the preceding question as $H_0: \mathbf{L}\boldsymbol{\beta}=\mathbf{0}$. Just give the $\mathbf{L}$ matrix. It is $r \times p$. What is $r$? What is $p$? \item Suppose you wanted to know whether the difference in effectiveness between Programme $A$ and the Waiting List depends on the participant's age. In terms of $\beta$ values, what null hypothesis would you test? \item Suppose you wanted to estimate the difference in average benefit between programmes $A$ and $C$ for a 27 year old participant. Give your answer in terms of $\widehat{\beta}$ values. \end{enumerate} \end{enumerate} \item This question explores the practice of ``centering" quantitative explanatory variables in a regression by subtracting off the mean. \begin{enumerate} \item Consider a simple experimental study with an experimental group, a control group and a single quantitative covariate. Independently for $i=1, \ldots, n$ let \begin{displaymath} Y_i = \beta_0 + \beta_1 x_i + \beta_2 d_i + \epsilon_i, \end{displaymath} where $x_i$ is the covariate and $d_i$ is an indicator dummy variable for the experimental group. If the covariate is ``centered," the model can be written \begin{displaymath} Y_i = \beta_0^\prime + \beta_1^\prime (x_i-\overline{x}) + \beta_2^\prime d_i + \epsilon_i, \end{displaymath} where $\overline{x} = \frac{1}{n}\sum_{i=1}^n x_i$. \begin{enumerate} \item Express the $\beta^\prime$ quantities in terms of the $\beta$ quantities. \item If the data are centered, what is $E(Y|x)$ for the experimental group compared to $E(Y|x)$ for the control group? \item By the invariance principle (this takes you back all the way to slide 25 of Likelihood Part One), what is $\widehat{\beta}_0$ in terms of $\widehat{\beta}^\prime$ quantities? Assume $\epsilon_i$ is normal. \end{enumerate} \item In this model, there are $p-1$ quantitative explanatory variables. The un-centered version is \begin{displaymath} Y_i = \beta_0 + \beta_1 x_{i,1} + \ldots + \beta_{p-1} x_{i,p-1} + \epsilon_i, \end{displaymath} and the centered version is \begin{displaymath} Y_i = \beta_0^\prime + \beta_1^\prime (x_{i,1}-\overline{x}_1) + \ldots + \beta_{p-1}^\prime (x_{i,p-1}-\overline{x}_{p-1}) + \epsilon_i, \end{displaymath} where $\overline{x}_j = \frac{1}{n}\sum_{i=1}^n x_{i,j}$ for $j = 1, \ldots, p-1$. \begin{enumerate} \item What is $\beta_0^\prime$ in terms of the $\beta$ quantities? \item What is $\beta_j^\prime$ in terms of the $\beta$ quantities? \item By the invariance principle, what is $\widehat{\beta}_0$ in terms of the $\widehat{\beta}^\prime$ quantities? Assume $\epsilon_i$ is normal. \item Using $\sum_{i=1}^n\widehat{Y}_i = \sum_{i=1}^nY_i$, show that $\widehat{\beta}_0^\prime = \overline{Y}$. \end{enumerate} \item Now consider again the study with an experimental group, a control group and a single covariate. This time the interaction is included. \begin{displaymath} Y_i = \beta_0 + \beta_1 x_i + \beta_2 d_i + \beta_3 x_id_i + \epsilon_i \end{displaymath} The centered version is \begin{displaymath} Y_i = \beta_0^\prime + \beta_1^\prime (x_i-\overline{x}) + \beta_2^\prime d_i + \beta_3^\prime (x_i-\overline{x})d_i + \epsilon_i \end{displaymath} \begin{enumerate} \item For the un-centered model, what is the difference between $E(Y|X=\overline{x})$ for the experimental group compared to $E(Y|X=\overline{x})$ for the control group? \item What is the difference between intercepts for the centered model? \end{enumerate} \end{enumerate} % \item \pagebreak \item \label{bweight} The \href{http://www.utstat.toronto.edu/~brunner/appliedf13/code_n_data/hw/bweight.data}{Birth weight data} set contains the following information on a sample of mothers who recently had babies. \begin{itemize} \item[] Identification code \item[] indicator of birth weight less than 2.5k \item[] Mother's age in years \item[] Mother's weight in pounds at last menstrual period \item[] Mother's race (1 = white, 2 = black, 3 = other) \item[] Smoking status during pregnancy \item[] Number of previous premature labours \item[] History of hypertension \item[] Presence of uterine irritability \item[] Number of physician visits during the first trimester \item[] Birth weight of baby in grams \end{itemize} For this question, we will use just Mother's weight, Mother's race and Baby's birth weight. \begin{enumerate} \item First, fit a model with parallel regression lines for the three racial groups. For all the hypothesis tests, be able to give the value of the test statistic, the $p$-value, whether you reject $H_0$ at $\alpha=0.05$, and state the conclusion in plain, non-statistical language. \begin{enumerate} \item What proportion of the variation in baby's weight is explained by the mother's weight and race together? \item Controlling for mother's weight, is mother's race related to baby's weight? \item If the answer to the last question is Yes, carry out Bonferroni-corrected pairwise comparisons and draw a plain language conclusion. \item Controlling for mother's race, is mother's weight related to baby's weight? If the answer is Yes, be able to say \emph{how} it's related. \item For every one pound increase in the mother's weight, the baby's estimated weight (increases, decreases) by \underline{\hspace{15mm}} grams. % Need a separate proc reg for this unless they read the manual. I'll get it when I generate the residuals. HUH? (2013) \end{enumerate} \item \label{interact} Now test whether race differences in baby's birth weight \emph{depend} on the mother's weight. In plain language, what do you conclude? \end{enumerate} \pagebreak \item In the following regression model, the explanatory variables $X_1$ and $X_2$ are random variables. The true model is \begin{displaymath} Y_i = \beta_0 + \beta_1 X_{i,1} + \beta_2 X_{i,2} + \epsilon_i, \end{displaymath} independently for $i= 1, \ldots, n$, where $\epsilon_i \sim N(0,\sigma^2)$. The mean and covariance matrix of the explanatory variables are given by \begin{displaymath} E\left[ \begin{array}{c} X_{i,1} \\ X_{i,2} \end{array} \right] = \left[ \begin{array}{c} \mu_1 \\ \mu_2 \end{array} \right] \mbox{~~ and ~~} Var\left[ \begin{array}{c} X_{i,1} \\ X_{i,2} \end{array} \right] = \left[ \begin{array}{rr} \phi_{11} & \phi_{12} \\ \phi_{12} & \phi_{22} \end{array} \right] \end{displaymath} Unfortunately $X_{i,2}$, which has an impact on $Y_i$ and is correlated with $X_{i,1}$, is not part of the data set. Since $X_{i,2}$ is not observed, it is absorbed by the intercept and error term, as follows. \begin{eqnarray*} Y_i &=& \beta_0 + \beta_1 X_{i,1} + \beta_2 X_{i,2} + \epsilon_i \\ &=& (\beta_0 + \beta_2\mu_2) + \beta_1 X_{i,1} + (\beta_2 X_{i,2} - \beta_2 \mu_2 + \epsilon_i) \\ &=& \beta^\prime_0 + \beta_1 X_{i,1} + \epsilon^\prime_i. \end{eqnarray*} The primes just denote a new $\beta_0$ and a new $\epsilon_i$. It was necessary to add and subtract $\beta_2 \mu_2$ in order to obtain $E(\epsilon^\prime_i)=0$. And of course there could be more than one omitted variable. They would all get swallowed by the intercept and error term, the garbage bins of regression analysis. \begin{enumerate} \item What is $Cov(X_{i,1},\epsilon^\prime_i)$? \item Calculate the variance-covariance matrix of $(X_{i,1},Y_i)$ under the true model. Is it possible to have non-zero covariance between $X_{i,1}$ and $Y_i$ when $\beta_1=0$? \item Suppose we want to estimate $\beta_1$. The usual least squares estimator is \begin{displaymath} \widehat{\beta}_1 = \frac{\sum_{i=1}^n(X_{i,1}-\overline{X}_1)(Y_i-\overline{Y})} {\sum_{i=1}^n(X_{i,1}-\overline{X}_1)^2}. \end{displaymath} You may just use this formula; you don't have to derive it. Is $\widehat{\beta}_1$ a consistent estimator of $\beta_1$ if the true model holds? Answer Yes or no and show your work. You may use the consistency of the sample variance and covariance without proof. \item Are there \emph{any} points in the parameter space for which $\widehat{\beta}_1 \stackrel{p}{\rightarrow} \beta_1$ when the true model holds? % \item Ordinary least squares is often applied to data sets where the independent variables are best modeled as random variables. In what way does the usual linear regression model imply that (random) independent variables and error terms have zero covariance? % In 2013, did this in an earlier assignment. \end{enumerate} \pagebreak %%%%%%%%%%%%%%%%%%%%%%%% mereg \item\label{mereg} Consider simple regression through the origin in which the explanatory variable values are random variables rather than fixed constants. But you can't see the explanatory variable. It is a \emph{latent} variable. Instead, all you see is the explanatory variable plus a piece of random noise. Independently for $i=1, \ldots, n$, let \begin{eqnarray} \label{witherror} Y_i & = & X_i \beta + \epsilon_i \\ W_i & = & X_i + e_i, \nonumber \end{eqnarray} where \begin{itemize} \item $X_i$ has expected value $\mu_x$ and variance $\sigma^2_x$, \item $e_i$ has expected value $0$ and variance $\sigma^2_e$ \item $\epsilon_i$ has expected value $0$ and variance $\sigma^2_\epsilon$ \item $X_i$, $\epsilon_i$ and $e_i$ are all independent. \end{itemize} The value of the explanatory variable $X_i$, like $\epsilon_i$ and $e_i$, is not observable. All we can see are the pairs $(W_i,Y_i)$ for $i=1, \ldots, n$. \begin{enumerate} \item Following common practice, we ignore the measurement error and apply the usual regression estimator with $W_i$ in place of $X_i$. The parameter $\beta$ is estimated by \begin{displaymath} \widehat{\beta}_{(1)} = \frac{\sum_{i=1}^n W_iY_i}{\sum_{i=1}^n W_i^2} \end{displaymath} Is $\widehat{\beta}_{(1)}$ a consistent estimator of $\beta$? Answer Yes, No or Impossible to determine. Show your work. \item Consider instead the estimator \begin{displaymath} \widehat{\beta}_{(2)} = \frac{\sum_{i=1}^n Y_i}{\sum_{i=1}^n W_i}. \end{displaymath} Is $\widehat{\beta}_{(2)}$ a consistent estimator of $\beta$? Answer Yes, No or Impossible to determine. Show your work. Does the value of $\mu$ matter? % Provided everything is notmal and mu is not zero, you can get the MLE explicitly without differentiating anything using the invariance principle. This could be pretty cool. Maybe you don't need mu ne zero to get the MLE, but the MLE itself goes to hell for large samples if mu=9. \item Suppose $X_i$, $\epsilon_i$ and $e_i$ are normally distributed. What is the joint distribution of $(W_i,Y_i)$? Calculate the vector of expected values and the covariance matrix. \item Using the invariance principle, obtain explicit formulas for the MLE of $\boldsymbol{\theta} = (\beta, \mu_x, \sigma^2_x, \sigma^2_e, \sigma^2_\epsilon)^\prime$ without differentiating anything. You may use without proof the fact that the MLE if a general multivariate normal is $(\overline{\mathbf{D}},\boldsymbol{\widehat{\Sigma}})$, where \begin{displaymath} \boldsymbol{\widehat{\Sigma}} = \frac{1}{n}\sum_{i=1}^n (\mathbf{D}_i-\overline{\mathbf{D}}) (\mathbf{D}_i-\overline{\mathbf{D}})^\prime . \end{displaymath} Use symbols like $\widehat{\sigma}_{xw}$ for the sample variances and covariances. \end{enumerate} % Ending parts of megeg question \end{enumerate} % \vspace{80mm} \noindent \begin{center}\begin{tabular}{l} \hspace{6.5in} \\ \hline \end{tabular}\end{center} This assignment was prepared by \href{http://www.utstat.toronto.edu/~brunner}{Jerry Brunner}, Department of Statistics, University of Toronto. It is licensed under a \href{http://creativecommons.org/licenses/by-sa/3.0/deed.en_US} {Creative Commons Attribution - ShareAlike 3.0 Unported License}. Use any part of it as you like and share the result freely. The \LaTeX~source code is available from the course website: \href{http://www.utstat.toronto.edu/~brunner/oldclass/appliedf13} {\texttt{http://www.utstat.toronto.edu/$^\sim$brunner/oldclass/appliedf13}} \end{document}