\documentclass[12pt]{article} %\usepackage{amsbsy} % for \boldsymbol and \pmb \usepackage{graphicx} % To include pdf files! \usepackage{amsmath} \usepackage{amsbsy} \usepackage{amsfonts} \usepackage[colorlinks=true, pdfstartview=FitV, linkcolor=blue, citecolor=blue, urlcolor=blue]{hyperref} % For links \usepackage{fullpage} %\pagestyle{empty} % No page numbers \begin{document} %\enlargethispage*{1000 pt} \begin{center} {\Large \textbf{STA 2101/442 Assignment 4}}\footnote{This assignment was prepared by \href{http://www.utstat.toronto.edu/~brunner}{Jerry Brunner}, Department of Statistics, University of Toronto. It is licensed under a \href{http://creativecommons.org/licenses/by-sa/3.0/deed.en_US} {Creative Commons Attribution - ShareAlike 3.0 Unported License}. Use any part of it as you like and share the result freely. The \LaTeX~source code is available from the course website: \href{http://www.utstat.toronto.edu/~brunner/oldclass/appliedf18} {\texttt{http://www.utstat.toronto.edu/$^\sim$brunner/oldclass/appliedf18}}} \vspace{1 mm} \end{center} \noindent These questions are practice for the midterm and final exam, and are not to be handed in. \begin{enumerate} %%%%%%%%%%%% %%%%%%%%%% \item It is well known that people who graduate from university have higher lifetime earnings on average than those who do not. Mention at least one confounding variable that could have produced this result. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Normal Regression %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \item Let $\mathbf{y}=\mathbf{X} \boldsymbol{\beta}+\boldsymbol{\epsilon}$, where $\mathbf{X}$ is an $n \times p$ matrix of known constants, $\boldsymbol{\beta}$ is a $p \times 1$ vector of unknown constants, and $\boldsymbol{\epsilon}$ is multivariate normal with mean zero and covariance matrix $\sigma^2 \mathbf{I}_n$. The constant $\sigma^2 > 0$ is unknown. % In the following, it may be helpful to recall that $(\mathbf{A}^{-1})^\top=(\mathbf{A}^\top)^{-1}$. \begin{enumerate} \item Show $X^\top \mathbf{e} = \mathbf{0}$. \item Why does $X^\top\mathbf{e}=\mathbf{0}$ tell you that if a regression model has an intercept, the residuals must add up to zero? \item Consider a regression model with an intercept, so that the sum of residuals is equal to zero. Prove the following decomposition of sums of squares, also given on the formula sheet: $SST=SSE+SSR$. Hint: Starting with scalar calculations, add and subtract $\widehat{y}_i$. Switch to matrix notation partway through the calculation. \end{enumerate} \item High School History classes from across Ontario are randomly assigned to either a discovery-oriented or a memory-oriented curriculum in Canadian history. At the end of the year, the students are given a standardized test and the median score of each class is recorded. Please consider a regression model with these variables: \begin{itemize} \item[$X_1$] Equals 1 if the class uses the discovery-oriented curriculum, and equals 0 if the class uses the memory-oriented curriculum. \item[$X_2$] Average parents' education for the classroom. \item[$X_3$] Average family income for the classroom. \item[$X_4$] Number of university History courses taken by the teacher. \item[$X_5$] Teacher's final cumulative university grade point average. \item[$Y$] Class median score on the standardized history test. \end{itemize} The full regression model (as opposed to the reduced models for various null hypotheses) implies \begin{displaymath} E[Y|X] = \beta_0 + \beta_1X_1 + \beta_2X_2 + \beta_3X_3 + \beta_4X_4 + \beta_5X_5. \end{displaymath} For each question below, please give \begin{itemize} \item The null hypothesis in terms of $\beta$ values. \item $E[Y|X]$ for the reduced model you would use to answer the question. Don't re-number the variables. \end{itemize} \vspace{2mm} \begin{enumerate} \item If you allow for parents' education and income and for teacher's university background, does curriculum type affect test scores? (And why is it okay to use the word "affect?") \item Controlling for parents' education and income and for curriculum type, is teacher's university background (two variables) related to their students' test performance? \item Correcting for teacher's university background and for curriculum type, are parents' education and family income (considered simultaneously) related to students' test performance? \item Taking curriculum type, teacher's university background and parents' education into consideration, is parents' income related to students' test performance? \item Here is one final question. Assuming that $X_1, \ldots, X_5$ are random variables (and I hope you agree that they are), \begin{enumerate} \item Would you expect $X_1$ ro be related to the other explanatory variables? \item Would you expect the other explanatory variables to be related to each other? \end{enumerate} \end{enumerate} \item \label{sat} In the United States, admission to university is based partly on high school marks and recommendations, and partly on applicants' performance on a standardized multiple choice test called the Scholastic Aptitude Test (SAT). The SAT has two sub-tests, Verbal and Math. A university administrator selected a random sample of 200 applicants, and recorded the Verbal SAT, the Math SAT and first-year university Grade Point Average (GPA) for each student. The data are available \href{http://www.utstat.toronto.edu/~brunner/data/legal/openSAT.data.txt} {here}. We seek to predict GPA from the two test scores. Throughout, please use the usual $\alpha=0.05$ significance level. \begin{enumerate} \item First, fit a model using just the Math score as a predictor. ``Fit" means estimate the model parameters. Does there appear to be a relationship between Math score and grade point average? \begin{enumerate} \item Answer Yes or No. \item Fill in the blank. Students who did better on the Math test tended to have \underline{~~~~~~~~~~~} first-year grade point average. \item Do you reject $H_0: \beta_1=0$? \item Are the results statistically significant? Answer Yes or No. \item What is the $p$-value? The answer can be found in \emph{two} places on your printout. \item What proportion of the variation in first-year grade point average is explained by score on the SAT Math test? The answer is a number from your printout. \item Give a predicted first-year grade point average for a student who got 700 on the Math SAT. The answer is a number you could get with a calculator from your printout. \end{enumerate} \item Now fit a model with both the Math and Verbal sub-tests. \begin{enumerate} \item Give the test statistic, the degrees of freedom and the $p$-value for each of the following null hypotheses. The answers are numbers from your printout. \begin{enumerate} \item $H_0: \beta_1=\beta_2=0$ \item $H_0: \beta_1=0$ \item $H_0: \beta_2=0$ \item $H_0: \beta_0=0$ \end{enumerate} \item Controlling for Math score, is Verbal score related to first-year grade point average? \begin{enumerate} \item Give the value of the test statistic. The answer is a number from your printout. \item Give the $p$-value. The answer is a number from your printout. \item Do you reject the null hypothesis? \item Are the results statistically significant? Answer Yes or No. \item In plain, non-statistical language, what do you conclude? The answer is something about test scores and grade point average. \end{enumerate} \item Allowing for Verbal score, is Math score related to first-year grade point average? \begin{enumerate} \item Give the value of the test statistic. The answer is a number from your printout. \item Give the $p$-value. The answer is a number from your printout. \item Do you reject the null hypothesis? \item Are the results statistically significant? Answer Yes or No. \item In plain, non-statistical language, what do you conclude? The answer is something about test scores and grade point average. \end{enumerate} \item Give a predicted first-year grade point average for a student who got 650 on the Verbal and 700 on the Math SAT. \item Let's do one more test. We want to know whether expected GPA increases faster as a function of the Verbal SAT, or the Math SAT. That is, we want to compare the regression coefficients, testing $H_0: \beta_1=\beta_2$. \begin{enumerate} \item Express the null hypothesis in matrix form as $\mathbf{L}\boldsymbol{\beta} = \mathbf{h}$. \item Carry out an $F$ test\footnote{If you do not remember how to do this with R from your regression course, several packages provide functions to do it --- for example, the \texttt{linear.hypothesis} function in the \texttt{car} (Companion to Applied Regression) package. Or you could just search online. When I did this, I found a useful \href{http://www.utstat.toronto.edu/~brunner/oldclass/302f14/lectures/302f14GeneralLinearTestWithR.pdf} {example from a regression course I taught several years ago}. Feel free to use my \texttt{ftest} function if you wish.}. \item State your conclusion in plain, non-technical language. It's something about first-year grade point average. % Can't conclude that expected GPA increases at different rates as a function of Verbal SAT and Math SAT. \end{enumerate} \end{enumerate} \end{enumerate} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Omitted variables %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \item Ordinary linear regression is often applied to data sets where the independent variables are best modeled as random variables: write $y_i = \mathbf{X}_i^\top\boldsymbol{\beta} + \epsilon_i$. In what way does the usual conditional linear regression model with normal errors imply that random explanatory variables have zero covariance with the error term? Hint: Assume $\mathbf{X}_i$ as well as $\epsilon_i$ continuous. What is the conditional distribution of $\epsilon_i$ given $\mathbf{X}_i$? \item For a model with just one (random) explanatory variable, show that $E(\epsilon_i|X_i=x_i)=0$ for all $x_i$ implies $Cov(X_i,\epsilon_i)=0$, so that a standard regression model without the normality assumption still implies zero covariance, though not necessarily independence, between the error term and explanatory variables. \item In the following regression model, the explanatory variables $X_1$ and $X_2$ are random variables. The true model is \begin{displaymath} Y_i = \beta_0 + \beta_1 X_{i,1} + \beta_2 X_{i,2} + \epsilon_i, \end{displaymath} independently for $i= 1, \ldots, n$, where $\epsilon_i \sim N(0,\sigma^2)$. The mean and covariance matrix of the explanatory variables are given by \begin{displaymath} E\left( \begin{array}{c} X_{i,1} \\ X_{i,2} \end{array} \right) = \left( \begin{array}{c} \mu_1 \\ \mu_2 \end{array} \right) \mbox{~~ and ~~} Var\left( \begin{array}{c} X_{i,1} \\ X_{i,2} \end{array} \right) = \left( \begin{array}{rr} \phi_{11} & \phi_{12} \\ \phi_{12} & \phi_{22} \end{array} \right) \end{displaymath} The explanatory variables $ X_{i,1}$ and $X_{i,2}$ are independent of $\epsilon_i$. Unfortunately $X_{i,2}$, which has an impact on $Y_i$ and is correlated with $X_{i,1}$, is not part of the data set. Since $X_{i,2}$ is not observed, it is absorbed by the intercept and error term, as follows. \begin{eqnarray*} Y_i &=& \beta_0 + \beta_1 X_{i,1} + \beta_2 X_{i,2} + \epsilon_i \\ &=& (\beta_0 + \beta_2\mu_2) + \beta_1 X_{i,1} + (\beta_2 X_{i,2} - \beta_2 \mu_2 + \epsilon_i) \\ &=& \beta^\prime_0 + \beta_1 X_{i,1} + \epsilon^\prime_i. \end{eqnarray*} The primes just denote a new $\beta_0$ and a new $\epsilon_i$. It was necessary to add and subtract $\beta_2 \mu_2$ in order to obtain $E(\epsilon^\prime_i)=0$. And of course there could be more than one omitted variable. They would all get swallowed by the intercept and error term, the garbage bins of regression analysis. \begin{enumerate} \item What is $Cov(X_{i,1},\epsilon^\prime_i)$? \item Calculate the variance-covariance matrix of $(X_{i,1},Y_i)$ under the true model. Is it possible to have non-zero covariance between $X_{i,1}$ and $Y_i$ when $\beta_1=0$? \item Suppose we want to estimate $\beta_1$. The usual least squares estimator is \begin{displaymath} \widehat{\beta}_1 = \frac{\sum_{i=1}^n(X_{i,1}-\overline{X}_1)(Y_i-\overline{Y})} {\sum_{i=1}^n(X_{i,1}-\overline{X}_1)^2}. \end{displaymath} You may just use this formula; you don't have to derive it. Is $\widehat{\beta}_1$ a consistent estimator of $\beta_1$ if the true model holds? Answer Yes or no and show your work. You may use the consistency of the sample variance and covariance without proof. \item Are there \emph{any} points in the parameter space for which $\widehat{\beta}_1 \stackrel{p}{\rightarrow} \beta_1$ when the true model holds? \end{enumerate} \item Independently for $i = 1, \ldots, n$, let $Y_i = \beta X_i + \epsilon_i$, where $X_i \sim N(\mu,\sigma^2_x)$ and $\epsilon_i \sim N(0,\sigma^2_\epsilon)$. Because of omitted variables that influence both $X_i$ and $Y_i$, we have $Cov(X_i,\epsilon_i) = c \neq 0$. \begin{enumerate} \item The least squares estimator of $\beta$ is $\frac{\sum_{i=1}^n X_iY_i}{\sum_{i=1}^n X_i^2}$. Is this estimator consistent? Answer Yes or No and prove your answer. \item Give the parameter space for this model. There are some constraints on $c$. \item First consider points in the parameter space where $\mu \neq 0$. Give an estimator of $\beta$ that converges almost surely to the right answer for that part of the parameter space. If you are not sure how to proceed, try calculating the expected value and covariance matrix of $(X_i,Y_i)$. \item What happens in the rest of the parameter space --- that is, where $\mu=0$? Is a consistent estimator possible there? So we see that parameters may be identifiable in some parts of the parameter space but not all. \end{enumerate} \item We know that omitted explanatory variables are a big problem, because they induce non-zero covariance between the explanatory variables and the error terms $\epsilon_i$. The residuals have a lot in common with the $\epsilon_i$ terms in a regression model, though they are not the same thing. A reasonable idea is to check for correlation between explanatory variables and the $\epsilon_i$ values by looking at the correlation between the residuals and explanatory variables. Accordingly, for a multiple regression model with an intercept so that $\sum_{i=1}^ne_i=0$, calculate the sample correlation $r$ between explanatory variable $j$ and the residuals $e_1, \ldots, e_n$. Use the formula for $r$ from the formula sheet. Simplify. What can the sample correlations between residuals and $x$ variables tell you about the correlation between $\epsilon$ and the $x$ variables? %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Bayes %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \pagebreak {\small \item Let $X_1, \ldots, X_n$ be random sample from a binomial distribution with parameters 4 and $\theta$, where $\theta$ is unknown. The prior distribution on $\theta$ is beta with parameters $\alpha$ and $\beta$. \begin{enumerate} \item Find the posterior density of $\theta$, including the constant that makes it integrate to one. % I get beta with alpha' = alpha + n*xbar and beta' = beta + n*(4-xbar) \item For $n=20$ observations and $\alpha=\beta=1$ (the uniform distribution), we obtain $\overline{X} = 2.3$. What is the posterior mean? Hint: The expected value of a beta random variable is $\frac{\alpha}{\alpha+\beta}$. \end{enumerate} \item Let $X_1, \ldots, X_n$ be random sample from a Poisson distribution with parameter $\lambda>0$. The prior on $\lambda$ will be gamma; use the parameterization on the formula sheet. Please derive the posterior density of $\lambda$, including the constant that makes it integrate to one. % Gamma, alpha' = alpha+n*xbar, beta' = beta+n \item Let $X_1, \ldots, X_n$ be random sample from a normal distribution with mean $\mu$ and precision $\tau$ (the precision is one over the variance -- see formula sheet). \begin{enumerate} \item Suppose that the parameter $\mu$ is known, while $\tau$ is unknown. The prior on $\tau$ is gamma, with the parameterization given on the formula sheet. Give the posterior distribution of $\tau$. \item Suppose that $\tau$ is known, while this time $\mu$ is unknown. The prior on $\mu$ is standard normal. Find the posterior distribution of $\mu$. \end{enumerate} \item Suppose the prior is a finite mixture of prior distributions. That is, the parameter $\theta$ has prior density \begin{displaymath} \pi(\theta) = \sum_{j=1}^k a_j \, \pi_j(\theta) \end{displaymath} The constants $a_1, \ldots, a_j$ are called \emph{mixing weights}; they are non-negative and they add up to one. Show that the posterior distribution is a mixture of the posterior distributions corresponding to $\pi_1(\theta), \ldots, \pi_k(\theta)$. What are the mixing weights of the posterior? This result can be useful if your model has a conjugate prior family, because you can represent virtually any prior opinion by a mixture of conjugate priors. For example, a bimodal prior might be just a mixture of two normals. Thus, you can have essentially any prior you wish, and also the convenience of an exact posterior distribution. \item Let $\theta$ represent the probability that a particular type of cancer, apparently wiped out by chemotherapy, will recur within 2 years. Suppose you \emph{really have no prior idea} about the value of $\theta$. Therefore, you adopt a ``non-informative" uniform prior distribution on the interval from zero to one. Of course, if you have no idea about the probability, you also have no idea about the log odds. The log odds is given by $\ln\frac{\theta}{1-\theta}$. Derive the density of the log odds if $\theta$ has a uniform distribution, and use R to plot it. Do you seem to have an idea about what the log odds should be? % I get e^y/(1+e^y)^2, symmetric around zero. The moral of this story is that when you adopt a uniform prior, you are still expressing an opinion. } % end size of last page \end{enumerate} \end{document} % Next time, A6 of 2017 has numerical MLEs. # Binomial set.seed(9999); n = 20; theta = 1/2; alpha=1; beta=1 x = rbinom(n,4,theta); xbar=mean(x); xbar (alpha+n*xbar)/(alpha+n*xbar + beta+n*(4-xbar)) > set.seed(9999); n = 20; theta = 1/2 > x = rbinom(n,4,theta); xbar=mean(x); xbar [1] 2.3 > (1+n*xbar)/(1+n*xbar + 1+n*(4-xbar)) [1] 0.5731707