\documentclass[12pt]{article} %\usepackage{amsbsy} % for \boldsymbol and \pmb \usepackage{graphicx} % To include pdf files! \usepackage{amsmath} \usepackage{amsbsy} \usepackage{amsfonts} \usepackage[colorlinks=true, pdfstartview=FitV, linkcolor=blue, citecolor=blue, urlcolor=blue]{hyperref} % For links \usepackage{fullpage} %\pagestyle{empty} % No page numbers \begin{document} %\enlargethispage*{1000 pt} \begin{center} {\Large \textbf{STA 2101/442f12 Assignment Ten}}\footnote{Copyright information is at the end of the last page.} \vspace{1 mm} \end{center} \noindent Please bring your log and list files for Question~\ref{bweight} to the quiz. The non-computer parts are just practice for the quiz, and are not to be handed in. Any necessary formulas will be provided. \begin{enumerate} \item This question explores the practice of ``centering" quantitative explanatory variables in a regression by subtracting off the mean. \begin{enumerate} \item Consider a simple experimental study with an experimental group, a control group and a single quantitative covariate. Independently for $i=1, \ldots, n$ let \begin{displaymath} Y_i = \beta_0 + \beta_1 x_i + \beta_2 d_i + \epsilon_i, \end{displaymath} where $x_i$ is the covariate and $d_i$ is an indicator dummy variable for the experimental group. If the covariate is ``centered," the model can be written \begin{displaymath} Y_i = \beta_0^\prime + \beta_1^\prime (x_i-\overline{x}) + \beta_2^\prime d_i + \epsilon_i, \end{displaymath} where $\overline{x} = \frac{1}{n}\sum_{i=1}^n x_i$. \begin{enumerate} \item Express the $\beta^\prime$ quantities in terms of the $\beta$ quantities. \item If the data are centered, what is $E(Y|x)$ for the experimental group compared to $E(Y|x)$ for the control group? \item By the invariance principle, what is $\widehat{\beta}_0$ in terms of $\widehat{\beta}^\prime$ quantities? Assume $\epsilon_i$ normal if you wish. \end{enumerate} \item In this model, there are $p-1$ quantitative explanatory variables. The un-centered version is \begin{displaymath} Y_i = \beta_0 + \beta_1 x_{i,1} + \ldots + \beta_{p-1} x_{i,p-1} + \epsilon_i, \end{displaymath} and the centered version is \begin{displaymath} Y_i = \beta_0^\prime + \beta_1^\prime (x_{i,1}-\overline{x}_1) + \ldots + \beta_{p-1}^\prime (x_{i,p-1}-\overline{x}_{p-1}) + \epsilon_i, \end{displaymath} where $\overline{x}_j = \frac{1}{n}\sum_{i=1}^n x_{i,j}$ for $j = 1, \ldots, p-1$. \begin{enumerate} \item What is $\beta_0^\prime$ in terms of the $\beta$ quantities? \item What is $\beta_j^\prime$ in terms of the $\beta$ quantities? \item By the invariance principle, what is $\widehat{\beta}_0$ in terms of the $\widehat{\beta}^\prime$ quantities? Assume $\epsilon_i$ normal if you wish. \item Show that $\widehat{\beta}_0^\prime = \overline{Y}$. Hint: Differentiate the log likelihood. % with respect to $\beta_0^\prime$. \end{enumerate} \item Now consider again the study with an experimental group, a control group and a single covariate. This time the interaction is included. \begin{displaymath} Y_i = \beta_0 + \beta_1 x_i + \beta_2 d_i + \beta_3 x_id_i + \epsilon_i \end{displaymath} The centered version is \begin{displaymath} Y_i = \beta_0^\prime + \beta_1^\prime (x_i-\overline{x}) + \beta_2^\prime d_i + \beta_3^\prime (x_i-\overline{x})d_i + \epsilon_i \end{displaymath} \begin{enumerate} \item For the un-centered model, what is the difference between $E(Y|X=\overline{x})$ for the experimental group compared to $E(Y|X=\overline{x})$ for the control group? \item What is the difference between intercepts for the centered model? \end{enumerate} \item Suppose that in the study with an experimental group, a control group and a single covariate, the response variable is binary and we are doing a logistic regression. \begin{enumerate} \item Under the un-centered model, if there is no interaction, the odds of $Y=1$ are \underline{~~~~~} times as great for the experimental group, for any fixed value of $x$. \item Under the \emph{centered} model, if there is no interaction, the odds of $Y=1$ are \underline{~~~~~} times as great for the experimental group, for any fixed value of $x$. \item If there \emph{is} an interaction and $x=\overline{x}$, the odds of $Y=1$ for the experimental group are \underline{~~~~~} times as great. Express the answer in terms of $\beta$ values, and also in terms of $\beta^\prime$ values. \end{enumerate} \end{enumerate} \item This question will be a lot easier if you remember that if $X \sim \chi^2(\nu)$, then $E(X)=\nu$ and $Var(X)=2\nu$. You don't have to prove this; just use it. You can also use things you already know about ordinary linear regression with normal errors. For the usual linear regression model with normal errors, $\sigma^2$ is usually estimated with $MSE$. \begin{enumerate} \item Show that $MSE$ is an unbiased estimator of $\sigma^2$. \item Show that $MSE$ is a consistent estimator of $\sigma^2$. \item Under the usual regression model what is the joint distribution of $\epsilon_1, \ldots, \epsilon_n$? \item Let $T_n = \frac{1}{n} \sum_{i=1}^n \epsilon_i^2$. What is $E(T_n)$? \item How do you know that $T_n \stackrel{p}{\rightarrow} \sigma^2$? \item Show that $Var(T_n) < Var(MSE)$. \item So it would appear that $T_n$ is a better estimator of $\sigma^2$ than $MSE$ is, since they are both unbiased and the variance of $T_n$ is lower. So why do you think $MSE$ is used in regression analysis instead of $T_n$? \end{enumerate} \newpage \item \label{bweight} In Question 9 of Assignment 7, you analyzed the Birth Weight data with R. This time we will use SAS. There is a link to the data from our \href{http://www.utstat.toronto.edu/~brunner/appliedf12/data/}{Data Sets} page. In Assignment 7 the mother's age did not do much, so the variables we will use this time are \begin{itemize} \item Mother's weight in pounds at her last period (\texttt{lwt}) \item Mother's race (\texttt{race}: 1 = white, 2 = black, 3 = other) \item Baby's birth weight in grams (\texttt{bwt}) \end{itemize} \begin{enumerate} \item First, fit a model with parallel regression lines for the three racial groups. For all the hypothesis tests, be able to give the value of the test statistic, the $p$-value, whether you reject $H_0$ at $\alpha=0.05$, and stte the conclusion in plain, non-statistical language. \begin{enumerate} \item What proportion of the variation in baby's weight is explained by the mother's weight and race together? \item Controlling for mother's weight, is mother's race related to baby's weight? \item If the answer to the last question is Yes, carry out Bonferroni-corrected pairwise comparisons and draw a plain language conclusion. \item Controlling for mother's race, is mother's weight related to baby's weight? If the answer is Yes, be able to say \emph{how} it's related. \item For every one pound increase in the mother's weight, the baby's weight (increases, decreases) by \underline{\hspace{15mm}} grams. % Need a separate proc reg for this unless they read the manual. I'll get it when I generate the residuals. \end{enumerate} \item \label{interact} Now test whether race differences in baby's birth weight \emph{depend} on the mother's weight. In plain language, what do you conclude? \item Before proceeding with the data analysis, let's do a little thinking about Studentized deleted residuals. As discussed in lecture, the Studentized deleted residuals have a $t$ distribution under the assumption that the observation in question comes from the same population as the other $n-1$. Thus, each Studentized deleted residual may be treated as the test statistic of a $t$-test. Get any requested numbers with \texttt{proc iml}. \begin{enumerate} \item For this data set, what is the critical value at $\alpha=0.05$? Please don't do any adjustments for multiple tests, yet. % 1.9729405 \item How many (absolute valued) Studentized deleted residuals would you expect to fall beyond this critical value just by chance if the model is correct for all $n$ observations? The answer is a number (not an integer). Look at the next question to see the reasoning. \item Write the number of Studentized deleted residuals beyond the critical value as a sum of random variables, then take the expected value. This shows that the non-independence of these random variables has no effect on the \emph{expected} number beyond the critical value. % 189*0.05 = 9.45 \item And indeed the random variables are not independent. There is one for each hypothesis test, and the test statistics have almost the same $\widehat{\beta}$ and $MSE$. How many test statistics are there? The answer is a number. % n=189 \item Suppose we want to protect all the tests against Type I error at \emph{joint} significance level 0.05 with a Bonferroni correction? What critical values of $t$ should we use? The answer is a number -- well, a pair of numbers. % 3.7199023 \item If the model is correct, the probability of getting \emph{any} Studentized deleted residuals beyond the Bonferroni critical value can be no more than \underline{\hspace{15mm}}. That's better! It's helpful to think of detecting outliers as a multiple comparison problem. \end{enumerate} \item Based on the results of Question~\ref{interact}, you will choose either a model with interactions or without interactions. For that model, generate the Studentized deleted residuals. \begin{enumerate} \item List all the Studentized deleted residuals that are beyond the the Bonferroni critical values. Is this cause for serious concern? \item How about approximate normality of the residuals? Base your assessment on tests using the usual $\alpha=0.05$ significance level. \end{enumerate} \end{enumerate} \end{enumerate} \vspace{70mm} %\newpage \noindent \begin{center}\begin{tabular}{l} \hspace{6.5in} \\ \hline \end{tabular}\end{center} This assignment was prepared by \href{http://www.utstat.toronto.edu/~brunner}{Jerry Brunner}, Department of Statistics, University of Toronto. It is licensed under a \href{http://creativecommons.org/licenses/by-sa/3.0/deed.en_US} {Creative Commons Attribution - ShareAlike 3.0 Unported License}. Use any part of it as you like and share the result freely. The \LaTeX~source code is available from the course website: \href{http://www.utstat.toronto.edu/~brunner/oldclass/appliedf12} {\texttt{http://www.utstat.toronto.edu/$^\sim$brunner/oldclass/appliedf12}} \end{document} Initially, I had a quote from Tennyson buried in the data. Lots of Invalid data and