\documentclass[12pt]{article} %\usepackage{amsbsy} % for \boldsymbol and \pmb \usepackage{graphicx} % To include pdf files! \usepackage{amsmath} \usepackage{amsbsy} \usepackage{amsfonts} \usepackage[colorlinks=true, pdfstartview=FitV, linkcolor=blue, citecolor=blue, urlcolor=blue]{hyperref} % For links \usepackage{fullpage} \topmargin=-.3in \textheight=9.4in %\pagestyle{empty} % No page numbers \begin{document} %\enlargethispage*{1000 pt} \begin{center} {\Large \textbf{STA 2101/442 Assignment Six}}\footnote{Copyright information is at the end of the last page.} \vspace{1 mm} \end{center} \noindent Please bring your R printouts to the quiz. The non-computer questions are just practice for the quiz, and are not to be handed in, though you may use R as a calculator. Bring a real calculator to the quiz. \vspace{5mm} \noindent Most of this assignment is based on the usual normal linear model \begin{displaymath} \mathbf{Y} = \mathbf{X} \boldsymbol{\beta} + \boldsymbol{\epsilon}, \end{displaymath} where $\mathbf{X}$ is an $n \times p$ matrix of known constants, $\boldsymbol{\beta}$ is a $p \times 1$ vector of unknown constants, and $\boldsymbol{\epsilon}$ is multivariate normal with mean zero and covariance matrix $\sigma^2 \mathbf{I}_n$, with $\sigma^2 > 0$ an unknown constant. You may use facts like these without proof. \begin{itemize} \item $\widehat{\boldsymbol{\beta}} = (\mathbf{X}^\prime \mathbf{X})^{-1} \mathbf{X}^\prime \mathbf{Y} \sim N_p\left(\boldsymbol{\beta}, \sigma^2 (\mathbf{X}^\prime \mathbf{X})^{-1}\right)$ \item $SSE/\sigma^2 \sim \chi^2(n-p)$ \item $SSE$ and $\widehat{\boldsymbol{\beta}}$ are independent \item If $Z\sim N(0,1)$ and $W \sim \chi^2(\nu)$ are independent, then \begin{displaymath} T = \frac{Z}{\sqrt{W/\nu}} \sim t(\nu). \end{displaymath} \end{itemize} \begin{enumerate} \item Show that the $n \times p$ matrix of covariances $C(\mathbf{e},\widehat{\boldsymbol{\beta}}) = \mathbf{0} $. This is the key to showing that $SSE$ and $\widehat{\boldsymbol{\beta}}$ are independent. \item Consider the prediction interval for $Y_{n+1}$. \begin{enumerate} \item What is the distribution of $Y_{n+1}-\widehat{Y}_{n+1} = Y_{n+1}-\mathbf{x}_{n+1}^\prime \widehat{\boldsymbol{\beta}}$? Show your work. Your answer includes both the expected value and the variance. \item Now standardize the difference to obtain a standard normal. \item Divide by the square root of a chi-squared random variable, divided by its degrees of freedom, and simplify. Call it $T$. Compare your answer to a slide from lecture. \item Using your result, derive the $(1-\alpha)\times 100\%$ prediction interval for $Y_{n+1}$. \end{enumerate} \item There is a difference between a prediction interval and a confidence interval. Using the preceding question as a guide, derive a $(1-\alpha)\times 100\%$ for $E(Y_{n+1})= \mathbf{x}_{n+1}^\prime \boldsymbol{\beta}$. This time, rather than trapping $Y_{n+1}$ in the interval, you are trapping $\mathbf{x}_{n+1}^\prime \boldsymbol{\beta}$. \item When you fit a full and a reduced regression model, the proportion of remaining variation explained by the additional variables in the full model is $a = \frac{R^2_F-R^2_R}{1-R^2_R}$. Show \begin{displaymath} F = \frac{(SSR_F-SSR_R)/r}{MSE_F} = \left(\frac{n-p}{r}\right) \left(\frac{a}{1-a}\right) \end{displaymath} \item In the usual univariate multiple regression model, the $\mathbf{X}$ is an $n \times p$ matrix of known constants. But of course in practice, the explanatory variables are random, not fixed. Clearly, if the model holds \emph{conditionally} upon the values of the explanatory variables, then all the usual results hold, again conditionally upon the particular values of the explanatory variables. The probabilities (for example, $p$-values) are conditional probabilities, and the $F$ statistic does not have an $F$ distribution, but a conditional $F$ distribution, given $\mathbf{X=x}$. \begin{enumerate} \item Show that the least-squares estimator $\widehat{\boldsymbol{\beta}}= (\mathbf{X}^{\prime}\mathbf{X})^{-1}\mathbf{X}^{\prime}\mathbf{Y}$ is conditionally unbiased. \item Show that $\widehat{\boldsymbol{\beta}}$ is also unbiased unconditionally. \item A similar calculation applies to the significance level of a hypothesis test. Let $F$ be the test statistic (say for an $F$-test comparing full and reduced models), and $f_c$ be the critical value. If the null hypothesis is true, then the test is size $\alpha$, conditionally upon the explanatory variable values. That is, $P(F>f_c|\mathbf{X=x})=\alpha$. Find the \emph{unconditional} probability of a Type I error. Assume that the explanatory variables are discrete, so you can write a multiple sum. \end{enumerate} \item \label{randslopes} It is perfectly natural to assume that something like response to a drug might be approximately linear over some range of dosage values, but that each person in the population might have his or her own slope. Thus each time you select a random sample you'll get a different collection of slopes, and the regression coefficient corresponding to the slope would be a random variable. Here is a simple model illustrating this situation. Let \begin{displaymath} Y_i = b_ix_i + \epsilon_i, \end{displaymath} where $x_1, \ldots, x_n$ are known constants, and independently for $i=1, \ldots, n$, \begin{itemize} \item[] $b_i$ is a random variable with expected value $\beta$ and variance $\sigma^2_1$, \item[] $\epsilon_i$ is a random variable with expected value zero and variance $\sigma^2_2$, and \item[] $b_i$ and $\epsilon_i$ are independent. \end{itemize} \begin{enumerate} \item This is a special case of the general mixed linear model (See Regression 1 slides). \begin{enumerate} \item What is the matrix $\mathbf{X}$? What is $p$? \item What is the matrix $\boldsymbol{\beta}$? \item What is the matrix $\mathbf{Z}$? What is $q$? \item What is the matrix $\mathbf{b}$? \item What is the matrix $\boldsymbol{\Sigma}_b$? \end{enumerate} \item What would happen if you tried to estimate $\beta$ in the usual way with \begin{displaymath} \widehat{\beta} = \frac{\sum_{i=1}^n x_i Y_i}{\sum_{i=1}^n x_i^2} \end{displaymath} Under what conditions on the $x_i$ values is this estimator consistent? \item Find another estimator of $\beta$ by calculating $E(\overline{Y}_n)$. Why does the Law of Large Numbers \emph{not} apply here? Okay, anyway, propose an estimator, and describe the conditions on the $x_i$ values that will make it consistent for $\beta$. % \item Which estimator do you like more? Why? \item Now suppose that all the $x_i$ values are equal to one. To make things as easy as possible, assume everything is normally distributed. \begin{enumerate} \item What is the distribution of $Y_i$ in this situation? \item Propose an estimator of $\beta$ that should satisfy anyone. \item Give an \emph{exact} $(1-\alpha)100\%$ confidence interval for $\beta$; you don't have to show any work. \item Now suppose that you want to estimate $\sigma^2_1$ and $\sigma^2_2$. Remember the problem from last assignment in which men and women were calling a help line according to independent Poisson processes, and we tried to estimate $\lambda_1$ and $\lambda_2$? Does that problem tell you anything about your chances of success? \end{enumerate} This last little example shows you two things. First, whether the parameters of a model can be estimated depends on how you collect the data; this is a matter of experimental design (assuming the $x_i$ values are under the control of the investigator). Second, it is possible that some parameters can be estimated very successfully, while others cannot be estimated at all. \end{enumerate} \item For most configurations of $x_1, \ldots, x_n$, the variance parameters in Question~\ref{randslopes} can be estimated successfully --- but it's not so easy to see how. So we'll do it numerically with maximum likelihood. Again, suppose everything is normally distributed. Some data from the model of Question~\ref{randslopes} are available from the class website, in the file \href{http://www.utstat.toronto.edu/~brunner/appliedf12/data/randslope.data} {\texttt{randslope.data}}. \begin{enumerate} \item Make a scatterplot of the data and bring it to the quiz. Does it look funny? You're guaranteed that the model is correct. \emph{Why} does the scatterplot look the way it does? How would it look if there were also a range of negative $x_i$ values? \item Estimate the parameters numerically. Your answer to this part is a set of three numbers. Show the definition of the function you're minimizing, as well as all the other input and output leading to your answer. Wondering about starting values? Well, at least you know where to start looking for $\widehat{\beta}$. \item Using the asymptotic variance of $\widehat{\beta}$, carry out a simple $2$-sided $Z$-test of $H_0: \beta=0$. Your output should include the computed value of $Z$ and the two-tailed $p$-value. Do you reject $H_0$ at $\alpha = 0.05?$ \end{enumerate} Bring your printout to the quiz. \newpage \item For this question, you will use the file \href{http://www.utstat.toronto.edu/~brunner/appliedf12/data/sat.data} {\texttt{sat.data}} from Assignment 5. There is a link on the course web page in case the one in this document does not work. We seek to predict GPA from the two test scores. Throughout, please use the usual $\alpha=0.05$ significance level. \begin{enumerate} \item First, fit a model using just the Math score as a predictor. ``Fit" means estimate the model parameters. Does there appear to be a relationship between Math score and grade point average? \begin{enumerate} \item Answer Yes or No. \item Fill in the blank. Students who did better on the Math test tended to have \underline{~~~~~~~~~~~} first-year grade point average. \item Do you reject $H_0: \beta_1=0$? \item Are the results statistically significant? Answer Yes or No. \item What is the $p$-value? The answer can be found in \emph{two} places on your printout. \item What proportion of the variation in first-year grade point average is explained by score on the SAT Math test? The answer is a number from your printout. \item Give a predicted first-year grade point average and a 95\% prediction interval for a student who got 700 on the Math SAT. \end{enumerate} \item Now fit a model with both the Math and Verbal sub-tests. \begin{enumerate} \item Give the test statistic, the degrees of freedom and the $p$-value for each of the following null hypotheses. The answers are numbers from your printout. \begin{enumerate} \item $H_0: \beta_1=\beta_2=0$ \item $H_0: \beta_1=0$ \item $H_0: \beta_2=0$ \item $H_0: \beta_0=0$ \end{enumerate} \item Controlling for Math score, is Verbal score related to first-year grade point average? \begin{enumerate} \item Give the null hypothesis in symbols. \item Give the value of the test statistic. The answer is a number from your printout. \item Give the $p$-value. The answer is a number from your printout. \item Do you reject the null hypothesis? \item Are the results statistically significant? Answer Yes or No. \item In plain, non-statistical language, what do you conclude? The answer is something about test scores and grade point average. \end{enumerate} \item Controlling for Verbal score, is Math score related to first-year grade point average? \begin{enumerate} \item Give the null hypothesis in symbols. \item Give the value of the test statistic. The answer is a number from your printout. \item Give the $p$-value. The answer is a number from your printout. \item Do you reject the null hypothesis? \item Are the results statistically significant? Answer Yes or No. \item In plain, non-statistical language, what do you conclude? The answer is something about test scores and grade point average. \end{enumerate} \item Math score explains \underline{~~~~~~~} percent of the remaining variation in grade point average once you take Verbal score into account. Using the formula from the slides (which will be provided on the quiz if you need it), you should be able to calculate this from the output of the \texttt{summary} function. Check your answer using the \texttt{anova} function. \item Verbal score explains \underline{~~~~~~~} percent of the remaining variation in grade point average once you take Math score into account. Using the formula from the slides (which will be provided on the quiz if you need it), you should be able to calculate this from the output of the \texttt{summary} function. Check your answer using the \texttt{anova} function. \item Give a predicted first-year grade point average and a 95\% prediction interval for a student who got 650 on the Verbal and 700 on the Math SAT. Are you confident that this student's first-year GPA will be above 2.0 (a C average)? \item Let's do one more test. We want to know whether expected GPA increases faster as a function of the Verbal SAT, or the Math SAT. That is, we want to compare the regression coefficients, testing $H_0: \beta_1=\beta_2$. \begin{enumerate} \item Express the null hypothesis in matrix form as $\mathbf{C}\boldsymbol{\beta} = \mathbf{h}$. Obviously, this should be pretty routine. \item But it's a bit more trouble than you'd think using \texttt{R}. I can think of three ways, all a little clumsy. Do the best you can. Carry out the test, producing an $F$ statistic, degrees if freedom (a pair of numbers) and a $p$-value. Be able to state your conclusion in plain, non-technical language. it's something about first-year grade point average. % Can't conclude that expected GPA increases at different rates as a function of Verbal SAT and Math SAT. \end{enumerate} \end{enumerate} Bring your printout to the quiz. \end{enumerate} \end{enumerate} \vspace{30mm} \noindent \begin{center}\begin{tabular}{l} \hspace{6.5in} \\ \hline \end{tabular}\end{center} This assignment was prepared by \href{http://www.utstat.toronto.edu/~brunner}{Jerry Brunner}, Department of Statistics, University of Toronto. It is licensed under a \href{http://creativecommons.org/licenses/by-sa/3.0/deed.en_US} {Creative Commons Attribution - ShareAlike 3.0 Unported License}. Use any part of it as you like and share the result freely. The \LaTeX~source code is available from the course website: \href{http://www.utstat.toronto.edu/~brunner/oldclass/appliedf12} {\texttt{http://www.utstat.toronto.edu/$^\sim$brunner/oldclass/appliedf12}} \end{document} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Maybe later \begin{enumerate} \item Finally, we'll take a look at the Studentized deleted residuals. Suppose we want to carry out $t$-tests to locate outliers. What is the critical value? The answer is a number that you can get from \texttt{R}. \item How many Studentized deleted residuals are beyond the critical value? The answer is a number. % 5 out of 200 \item If the model were correct, how many Studentized deleted residuals would you expect to be beyond the critical value just by chance? The answer is a number. % Later, a good Bonferroni correction question. \item Make a normal QQ plot of the residuals (Studentized deleted or not). You might as well use the \texttt{qqnorm} and \texttt{qqline} functions for convenience. Do you see evidence of departure from normality? Describe any pattern you see. \item Plot residuals against variables in the equation. Do you see any need for a quadratic term? Any apparent outliers in 2 dimensions? % No, Yes \end{enumerate} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ That random slopes model from Q3 rm(list=ls()) # Simulate the data set.seed(9999) nminus10 = 90; beta = 1; sigsqbeta=4; sigsqeps=9 x = c(numeric(10),(1:nminus10)/10); n = length(x) B = rnorm(n,beta,sqrt(sigsqbeta)) epsilon = rnorm(n,0,sqrt(sigsqeps)) Y = x*B + epsilon Y = round(Y,4) cor(x,Y) # 0.2701852 plot(x,Y) # Wow. dat = cbind(x,Y) # Ugly: write(t(dat),file="randslope.data",ncolumns=2) dat # Edit this and put it in the file randslope.data rsmll <- function(theta,datta) # Random slopes minus log likelihood # datta must be n x 2 # theta = (beta, sigsqbeta, sigsqepsilon) { xx = datta[,1]; yy = datta[,2] b = theta[1]; vb = theta[2]; ve = theta[3] rsmll = -sum(dnorm(yy,xx*b,sqrt(x^2*vb+ve),log=T)) rsmll } # End of function rsmll rsmll(c(1,4,9),datta=dat) # 349.4072 at true parameter values bstart = mean(Y)/mean(x); bstart # 0.9455372 summary(lm(Y~-1+x)) # 1.0045 estart = var(Y[x==0]); estart mx2 = mean(x^2); my2 = mean(Y^2) sbstart = (my2 - estart - bstart^2*mx2)/mx2 bstart; estart; sbstart unrest = nlm(rsmll,c(bstart,estart,sbstart),hessian=T,datta=dat) unrest Vhat = solve(unrest$hessian); Vhat Z = unrest$estimate[1]/sqrt(Vhat[1,1]); Z pval = 2*(1-pnorm(Z)); pval # rsmll(c(bstart,sbstart,estart),datta=dat) # 348.0576 at starting values # A few more starting values ... # Compare estimate of (0.846385 4.965494 5.802456), minimum 347.9452 =========== > Z = unrest$estimate[1]/sqrt(Vhat[1,1]); Z [1] 3.285027 > pval = 2*(1-pnorm(Z)); pval [1] 0.001019724 > unrest$estimate [1] 0.846385 4.965494 5.802456 =========== nlm(rsmll,c(1,4,9),hessian=T,datta=dat) # Truth nlm(rsmll,c(-10,1,1),hessian=T,datta=dat) # Badly false # I'm content ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ # Last Question (SAT) Data have been fixed: No more 100 bottles of beer. rm(list=ls()) dat = read.table("http://www.utstat.toronto.edu/~brunner/appliedf12/data/sat.data") attach(dat) # a. modelA = lm(GPA ~ MATH); summary(modelA) # Prediction interval math700 = data.frame(MATH=700) predict(modelA,newdata=math700,interval="prediction") summary(modelA)$r.squared #$ Ending syntax colouring #b modelB = lm(GPA ~ MATH+VERBAL); summary(modelB) verbal650math700 = data.frame(VERBAL=650,MATH=700) predict(modelB,newdata=verbal650math700,interval="prediction") # fit lwr upr # [1,] 2.805857 1.719307 3.892408 # a = r*F/(n-p + r*F) # Can get these quantities by visual inspection without knowing the names. NminusP = summary(modelB)$fstatistic[3] #$ Ending syntax colouring r=1 # For t-tests F_math = summary(modelB)$coefficients[2,3]^2 #$ Ending syntax colouring a_math = r*F_math/(NminusP + r*F_math) names(a_math) = NULL; a_math # 0.01348301 F_verbal = summary(modelB)$coefficients[3,3]^2 #$ Ending syntax colouring a_verbal = r*F_verbal/(NminusP + r*F_verbal) names(a_verbal) = NULL; a_verbal # 0.08139433 # Now check a_math=0.01348301 and a_verbal=0.08139433 verbalmath = anova(lm(GPA~VERBAL+MATH)); verbalmath mathverbal = anova(modelB); mathverbal SSTO = sum(verbalmath[,2]); SSTO var(GPA) * 199 # Just checking a_math; verbalmath[2,2]/(SSTO-verbalmath[1,2]) a_verbal; mathverbal[2,2]/(SSTO-mathverbal[1,2]) # Testing equal slopes sat = VERBAL+MATH # Total score modelC = lm(GPA ~ sat); summary(modelC) # Restricted model under H0: beta1=beta2 anova(modelC,modelB) # Can't conclude that expected GPA increases at different rates as a function of Verbal SAT and Math SAT. # Not yet! # Residual checks critval = qt(0.975,NminusP-1); critval # For Studentized deleted residuals stdel = rstudent(modelB) length(stdel[abs(stdel)>critval]) # 5 Compare 10 expected by chance. qqnorm(stdel); qqline(stdel) qqnorm(modelB$residuals); qqline(modelB$residuals) # Looks like both highest and lowest residuals are lower than expected. I suspect it's a crash and burn sub-population at the bottom and a ceiling effect at the top (max GPA = 3.9). plot(VERBAL,stdel) # A couple of good verbal kids and a moderate verbal crashed. plot(MATH,stdel) # 3 moderate to high math crashed, including the 2.