\documentclass[11pt]{article} %\usepackage{amsbsy} % for \boldsymbol and \pmb \usepackage{graphicx} % To include pdf files! \usepackage{amsmath} \usepackage{amsbsy} \usepackage{amsfonts} \usepackage[colorlinks=true, pdfstartview=FitV, linkcolor=blue, citecolor=blue, urlcolor=blue]{hyperref} % For links \usepackage{fullpage} %\pagestyle{empty} % No page numbers \begin{document} %\enlargethispage*{1000 pt} \begin{center} {\Large \textbf{STA 431s15 Assignment Three}}\footnote{This assignment was prepared by \href{http://www.utstat.toronto.edu/~brunner}{Jerry Brunner}, Department of Statistical Sciences, University of Toronto. It is licensed under a \href{http://creativecommons.org/licenses/by-sa/3.0/deed.en_US} {Creative Commons Attribution - ShareAlike 3.0 Unported License}. Use any part of it as you like and share the result freely. The \LaTeX~source code is available from the course website: \href{http://www.utstat.toronto.edu/~brunner/oldclass/431s15} {\small\texttt{http://www.utstat.toronto.edu/$^\sim$brunner/oldclass/431s15}}} \vspace{1 mm} \end{center} \noindent The non-computer questions on this assignment are practice for Term Test One on Monday February 2nd. There will be no SAS on Term Test One. The SAS part of this assignment (Question~\ref{SAS}) is for Quiz Three on Friday February 6th. Please bring your log file and your output file to the quiz. There will be one or more questions about them, and you will be asked to hand them in with the quiz. \begin{enumerate} % MOM \item For each of these problems, $Y_1, \ldots, Y_n$ are a random sample from a distribution with the given density of probability mass function. For each one, obtain the formula for a Method of Moments estimator, and then calculate your formula for the given data. There is more than one estimator for most distributions, but try to find a nice simple one. \begin{enumerate} \item $f(x) = \theta x^{\theta-1}$ for $00$. Data: \texttt{0.04, 0.69, 0.86, 0.24, 0.99} % Beta, alpha=theta and beta=1. True theta=1, MLE = -1/mean(log(x)) = 0.965637, MOM is xbar/(1-xbar) = 1.293578 %\item $f(y|\theta) = \theta e^{-\theta y}$ for $y>0$. Data: \texttt{0.54, 0.44, 0.19, 0.34}. % 2.649 Textbook only. \item $p(x) = \binom{4}{x}\theta^x (1-\theta)^{4-x}$ for $x = 0, \ldots, 4$. Data: \texttt{1, 2, 3, 2, 0, 2}. Use the expected value of a Binomial without proof. \item $f(x) = \frac{1}{2\theta}$ for $-\theta < x < \theta$ where $\theta>0$. Data: \texttt{-1.43 1.89 1.58 -0.62 0.72 -1.75}. % mean(x^2) = 2.013117 \end{enumerate} \item Independently for $i=1, \ldots, n$, let $\mathbf{Y}_i = \boldsymbol{\beta}_0 + \boldsymbol{\beta}_1 \mathbf{X}_i + \boldsymbol{\epsilon}_i$, where \begin{itemize} \item $\mathbf{Y}_i$ is an $q \times 1$ random vector of observable response variables; there are $q$ response variables. \item $\mathbf{X}_i$ is a $p \times 1$ observable random vector; there are $p$ explanatory variables. $E(\mathbf{X}_i) = \boldsymbol{\mu}_x$ and $V(\mathbf{X}_i) = \boldsymbol{\Phi}_{p \times p}$. The positive definite matrix $\boldsymbol{\Phi}$ is unknown. \item $\boldsymbol{\beta}_0$ is a $q \times 1$ matrix of unknown constants. \item $\boldsymbol{\beta}_1$ is a $q \times p$ matrix of unknown constants. \item $\boldsymbol{\epsilon}_i$ is a $q \times 1$ random vector with expected value zero and unknown positive definite variance-covariance matrix $V(\boldsymbol{\epsilon}_i) = \boldsymbol{\Psi}_{q \times q}$. \item $\boldsymbol{\epsilon}_i$ is independent of $\mathbf{X}_i$. \end{itemize} Letting $\mathbf{D}_i = \left(\begin{array}{c} \mathbf{X}_i \\ \hline \mathbf{Y}_i \end{array} \right)$, we have $V(\mathbf{D}_i) = \boldsymbol{\Sigma} = \left( \begin{array}{c|c} \boldsymbol{\Sigma}_x & \boldsymbol{\Sigma}_{xy} \\ \hline \boldsymbol{\Sigma}_{yx} & \boldsymbol{\Sigma}_y \end{array} \right)$, and $\widehat{\boldsymbol{\Sigma}} = \left( \begin{array}{c|c} \widehat{\boldsymbol{\Sigma}}_x & \widehat{\boldsymbol{\Sigma}}_{xy} \\ \hline \widehat{\boldsymbol{\Sigma}}_{yx} & \widehat{\boldsymbol{\Sigma}}_y \end{array} \right)$. \begin{enumerate} \item Start by writing $\boldsymbol{\Sigma}$ in terms of the unknown parameter matrices. \item Give a Method of Moments Estimator for $\boldsymbol{\Phi}$. Just write it down. \item Obtain formulas for the Method of Moments Estimators of $\boldsymbol{\beta}_1$, $\boldsymbol{\beta}_0$ and $\boldsymbol{\Psi}$. Show your work. You may give $\widehat{\boldsymbol{\beta}}_0$ in terms of $\widehat{\boldsymbol{\beta}}_1$, but simplify $\widehat{\boldsymbol{\Psi}}$. \end{enumerate} \vspace{20mm} \pagebreak \item Independently for $i=1, \ldots, n$, let $Y_i = \beta X_i + \epsilon_i$, where \begin{itemize} \item $E(X_i)=\mu_x$, $Var(X_i)=\sigma^2_x$ \item $E(\epsilon_i)=0$, $Var(\epsilon_i)=\sigma^2_\epsilon$ \item $X_i$ and $\epsilon_i$ are independent. \end{itemize} \begin{enumerate} \item Obtain formulas for the Method of Moments Estimators of $\beta$ and $\sigma_\epsilon^2$. Show your work. \item Calculate your estimates for the following data: \begin{verbatim} x 0.0 1.3 3.2 -2.5 -4.6 -1.6 4.5 3.8 y -0.8 -1.3 7.4 -5.2 -6.5 -4.9 9.9 7.2 \end{verbatim} % To make it easier, use these calculations: \begin{verbatim} x x^2 y y^2 xy 0.0 0.00 -0.8 0.64 0.00 1.3 1.69 -1.3 1.69 -1.69 3.2 10.24 7.4 54.76 23.68 -2.5 6.25 -5.2 27.04 13.00 -4.6 21.16 -6.5 42.25 29.90 -1.6 2.56 -4.9 24.01 7.84 4.5 20.25 9.9 98.01 44.55 3.8 14.44 7.2 51.84 27.36 -------------------------------------------------------- Mean 0.5125 9.57375 0.725 37.53 18.08 \end{verbatim} Doing $\widehat{\beta}$ two different ways, I got 1.888497 and 1.414634. Doing $\widehat{\sigma}^2$ two different ways, I got 3.385971 and 3.797089. There are other correct answers. \end{enumerate} % LRT \item In the text, the material on maximum likelihood and likelihood ratio test tends to emphasize numerical MLEs, and is a little more theoretical in places than we are going to be. However, parts may be quite useful. Please start reading on Page 128, and then do Exercises 1 and 2 starting on Page 130. If you did not need this review, please accept my apologies. \item The formula sheet has a useful expression for the multivariate normal likelihood. \begin{enumerate} \item Show that you understand the notation by giving the univariate version, in which $X_1, \ldots, X_n \stackrel{i.i.d}{\sim} N(\mu,\sigma^2)$. Your answer will have no matrix notation for the trace, transpose or inverse. \item Now starting with the univariate normal density (also on the formula sheet), show that the univariate normal likelihood is the same as your answer to the previous question. Hint: Add and subtract $\overline{X}$. \item How does this expression allow you to see \emph{without differentiating} that the MLE of $\mu$ is $\overline{X}$? % A.5.2 in the text. \end{enumerate} \pagebreak \item Let $Y_1, \ldots, Y_n$ be a random sample from a distribution with density $f(y) = \frac{1}{\theta} e^{-\frac{y}{\theta}}$ for $y>0$, where the parameter $\theta>0$. We are interested in testing $H_0:\theta=\theta_0$. \begin{enumerate} \item What is $\Theta$? \item What is $\Theta_0$? \item Derive a general expression for the large-sample likelihood ratio statistic. \item A sample of size $n=100$ yields $\overline{Y}=1.37$ and $S^2=1.42$. One of these quantities is unnecessary and just provided to irritate you. Well, actually it's a mild substitute for reality, which always provides you with a huge pile of information you don't need. Anyway, we want to test $H_0:\theta=1$. You can do this with a calculator. When I did it a long time ago I got $G^2=11.038$. \item What is the critical value at $\alpha=0.05$? The answer is a number from the formula sheet. \item Do you reject $H_0$? Answer Yes or No. \item Is there evidence that $\theta=1$? Answer Yes or No. \end{enumerate} \item The label on the peanut butter jar says peanuts, partially hydrogenated peanut oil, salt and sugar. But we all know there is other stuff in there too. In the United States, the Food and Drug administration requires that a shipment of peanut butter be rejected if it contains an average of more than 8 rat hairs per pound (well, I'm not sure if it's exactly 8, but let's pretend). There is very good reason to assume that the number of rat hairs per pound has a Poisson distribution with mean $\lambda$, because it's easy to justify a Poisson process model for how the hairs get into the jars. We will test $H_0:\lambda=\lambda_0$. \begin{enumerate} \item What is $\Theta$? \item What is $\Theta_0$? \item Derive a general expression for the large-sample likelihood ratio statistic. \item We sample 100 1-pound jars, and observe a sample mean of $\overline{Y}= 8.57$. Should we reject the shipment? We want to test $H_0:\lambda=8$. What is the value of $G^2$? You can do this with a calculator. When I did it a long time ago I got $G^2=3.97$. \item Do you reject $H_0$ at $\alpha=0.05$? Answer Yes or No. \item Do you reject the shipment of peanut butter? Answer Yes or No. \end{enumerate} \item Let $X_1, \ldots, X_n \stackrel{i.i.d}{\sim} N(\mu,\sigma^2)$. \begin{enumerate} \item Derive a general expression for the large-sample likelihood ratio test statistic for testing $H_0: \sigma^2 = \sigma^2_0$ versus $ \sigma^2 \neq \sigma^2_0$. \item A random sample of size $n=50$ yields $\overline{X}=9.91$ and $\widehat{\sigma}^2 = 0.92$. \item What is the critical value at $\alpha=0.05$? The answer is a number from the formula sheet. \item Do you reject $H_0: \sigma^2=1$ at $\alpha = 0.05$? \item What, if anything, do you conclude? \end{enumerate} \pagebreak \item You might want to look again at the coffee taste test example from lecture before starting this question. An email spam company designs $k$ different emails, and randomly assigns email addresses (from a huge list they bought somewhere) to receive the different email messages. So, this is a true experiment, in which the message a person receives is the experimental treatment. $n_1$ email addresses receive message 1, $n_2$ email addresses receive message 2, \ldots, and $n_k$ email addresses receive message $k$. The response variable is whether the recipient clicks on the link in the email message: $Y_{ij}=1$ if recipient $i$ in Treatment $j$ clicks on the link, and zero otherwise. According to our model, all these observations are independent, with $P(Y_{ij}) = \theta_j$ for $i = 1, \ldots, n_j$ and $j = 1, \ldots, k$. We want to know if there are any differences in the effectiveness of the treatments. \begin{enumerate} \item What is $H_0$? \item What is $\Theta$? \item What is $\Theta_0$? \item Write the likelihood function. \item What is $\widehat{\boldsymbol{\theta}}$? If you think about it you can write down the answer without doing any work. \item What is $\widehat{\boldsymbol{\theta}}_0$? If you think about it you can write down the answer without doing any work. \item Write down and simplify a general expression for the large-sample likelihood ratio statistic $G^2$. What are the degrees of freedom? \item Comparing three spam messages with $n_1=n_2=n_3=1,000$, the company obtains $\overline{Y}_1=0.044$, $\overline{Y}_2=0.050$ and $\overline{Y}_3=0.061$. \item What is the test statistic $G^2$? The answer is a number. \item What is the critical value at $\alpha = 0.05$? The answer is a number from the formula sheet. \item Do you reject $H_0$? Answer Yes or No. \item Is there evidence that the messages differ in their effectiveness? Answer Yes or No. \end{enumerate} \item You may think of this as a continuation of Question 2 of Assignment 2. Let $Y_i = \beta x_i + \epsilon_i$ for $i=1, \ldots, n$, where $\epsilon_1, \ldots, \epsilon_n$ are a random sample from a normal distribution with expected value zero and variance $\sigma^2$. The parameters $\beta$ and $\sigma^2$ are unknown constants. The numbers $x_1, \ldots, x_n$ are known, observed constants. \begin{enumerate} \item What is $\Theta$? \item If the null hypothesis is $H_0: \beta=\beta_0$, what is $\Theta_0$? \item What is $\widehat{\beta}$? Just use your answer from Assignment 2 (but be able to do it again.) \item What is $\widehat{\sigma}^2$? Again just use your answer from Assignment 2, but be able to do it again. \item What is $\widehat{\sigma}^2_0$? \item Show $G^2 = n\ln\frac{\sum_{i=1}^n(Y_i-\beta_0x_i)^2}{\sum_{i=1}^n(Y_i-\widehat{\beta}x_i)^2}$ \end{enumerate} \item Let $\mathbf{D}_1, \ldots, \mathbf{D}_n$ be a random sample from a multivariate normal population with mean $\boldsymbol{\mu}$ and variance-covariance matrix $\boldsymbol{\Sigma}$. Write $\mathbf{D}_i = \left(\begin{array}{c} \mathbf{X}_i \\ \hline \mathbf{Y}_i \end{array} \right)$, where $\mathbf{X}_i$ is $q \times 1$, $\mathbf{Y}_i$ is $r \times 1$, and $p = q+r$, we have $V(\mathbf{D}_i) = \boldsymbol{\Sigma} = \left( \begin{array}{c|c} \boldsymbol{\Sigma}_x & \boldsymbol{\Sigma}_{xy} \\ \hline \boldsymbol{\Sigma}_{yx} & \boldsymbol{\Sigma}_y \end{array} \right)$, and $\widehat{\boldsymbol{\Sigma}} = \left( \begin{array}{c|c} \widehat{\boldsymbol{\Sigma}}_x & \widehat{\boldsymbol{\Sigma}}_{xy} \\ \hline \widehat{\boldsymbol{\Sigma}}_{yx} & \widehat{\boldsymbol{\Sigma}}_y \end{array} \right)$. \begin{enumerate} \item Calculate and simplify the large-sample likelihood ratio statistic $G^2$ for testing $H_0: \boldsymbol{\Sigma}_{xy} = \mathbf{0}$, which is equivalent to $\mathbf{X}_i$ and $\mathbf{Y}_i$ independent. Start with the likelihood and MLEs on the formula sheet. Your answer is a formula. What are the degrees of freedom? % G^2 = n (log|SigmaHatX| + log|SigmaHatY| - log|SigmaHat|), df = qr \item For the Twins Data, $\mathbf{X}_i$ could be the vector of three mental measurements and $\mathbf{Y}_i$ could be the vector of six physical measurements. For $n=74$, I calculated $\ln|\widehat{\boldsymbol{\Sigma}}| = 40.814949$, $\ln|\widehat{\boldsymbol{\Sigma}}_x| = 14.913525$ and $\ln|\widehat{\boldsymbol{\Sigma}}_y| = 26.33133$. \begin{enumerate} \item Calculate $G^2$ for these data. Your answer is a number. % \item What are the degrees of freedom? Your answer is a number. \item SAS \texttt{proc calis} gave us Chi-square = 31.3831 for this problem (see lecture notes). Multiply $\frac{n-1}{n}G^2$ to get this number. \end{enumerate} % Testing difference between covariance matrices for independent groups must be solved, but it's not just an exercise, at least not for me. \end{enumerate} %{\small \item \label{SAS} In the United States, admission to university is based partly on high school marks and recommendations, and partly on applicants' performance on a standardized multiple choice test called the Scholastic Aptitude Test (SAT). The SAT has two sub-tests, Verbal and Math. A university administrator selected a random sample of 200 applicants, and recorded the Verbal SAT, the Math SAT and first-year university Grade Point Average (GPA) for each student. The data are given in the file \href{http://www.utstat.toronto.edu/~brunner/data/legal/openSAT.data.txt} {\texttt{openSAT.data.txt}}. Your job is to test whether the variance of the Verbal SAT scores is different from the variance of the Math SAT scores. Start by estimating the variances; I did it with \texttt{proc corr}. Produce a large-sample Chi-squared test. Either $G^2$ or $\frac{n-1}{n}G^2$ is okay. On your printout, you should be able to locate \begin{itemize} \item Unrestricted \emph{maximum likelihood} estimates of the two variances (meaning divide by $n$, not $n-1$), allowing them to be unequal. These are numbers. \item The value of the test statistic (a number). \item The degrees of freedom (a number). \item The $p$-value (a number or range of numbers). \end{itemize} Using if $H_0$ is rejected at the $\alpha = 0.05$ significance level, you should be able to state a conclusion like ``The variance of the Verbal SAT scores is greater," or ``The variance of the Math SAT scores is greater." Bring your log file and output file to the quiz. You may be asked for numbers from your printouts, and you will definitely be asked to hand them in. For full marks, \textbf{there must be no warnings, error messages or notes about missing data on your log file.} Please follow these guidelines. Marks will be deducted if you do not. \begin{itemize} \item Put your name and student number in a \texttt{title} or \texttt{title2} statement. \item Do not write anything on your printouts in advance of the quiz. \item Bring your log file to the quiz, \emph{not} just a listing of the program file. \item The log file and the output file must be from the same run of SAS. \item Your output file must have a date stamp. This is automatically generated if you save a pdf file or print from SAS Studio. \item You must use \emph{your} installation of SAS, not the installation on someone else's computer. \end{itemize} % } % End size \end{enumerate} \end{document} > x = c(0.0, 1.3, 3.2, -2.5, -4.6, -1.6, 4.5, 3.8) > y = c(-0.8, -1.3, 7.4, -5.2, -6.5, -4.9, 9.9, 7.2) > xsq = x^2; ysq = y^2; xy = x*y > w = cbind(x,xsq,y,ysq,xy); w x xsq y ysq xy [1,] 0.0 0.00 -0.8 0.64 0.00 [2,] 1.3 1.69 -1.3 1.69 -1.69 [3,] 3.2 10.24 7.4 54.76 23.68 [4,] -2.5 6.25 -5.2 27.04 13.00 [5,] -4.6 21.16 -6.5 42.25 29.90 [6,] -1.6 2.56 -4.9 24.01 7.84 [7,] 4.5 20.25 9.9 98.01 44.55 [8,] 3.8 14.44 7.2 51.84 27.36 > apply(w,2,mean) x xsq y ysq xy 0.51250 9.57375 0.72500 37.53000 18.08000 > > > betahat = mean(xy)/mean(xsq); betahat [1] 1.888497 > n=8 > sigmasqhatepsilon1 = mean(ysq) - mean(xy)^2/mean(xsq) > sigmasqhatepsilon1 [1] 3.385971 > sx = var(x)*(n-1)/n; sx [1] 9.311094 > sy = var(y)*(n-1)/n; sy [1] 37.00438 > sigmasqhatepsilon2 = sy - betahat^2 * sx; sigmasqhatepsilon2 [1] 3.797089 SAS Work for just to get the determinants /********************** HWtwincov.sas ****************************/ title 'Work for non-computer part of HW3: Twins Data'; %include '/folders/myfolders/431s15/twinread.sas'; proc corr nomiss nosimple nocorr cov vardef=n out=S; /* Exclude observations with any missing values */ title3 'Full covariance matrix'; var progmat reason verbal headlng headbrd headcir bizyg height weight; proc print; title3 'Look at the output data set'; proc iml; title3 'Compute the test statistic for Body-Mind'; use S; read all var {progmat reason verbal headlng headbrd headcir bizyg height weight} where(_type_ = 'COV') into SigmaHat; /* Will need n also */ read all var {progmat} where(_type_ = 'N') into n; /* Extract sub-matrices */ SigmaHatX = SigmaHat[1:3,1:3]; SigmaHatY = SigmaHat[4:9,4:9]; print SigmaHat; print SigmaHatX; print SigmaHatY; print n; LogDet = log(det(SigmaHat)); LogDetX = log(det(SigmaHatX)); LogDetY = log(det(SigmaHatY)); print LogDet LogDetX LogDetY; print "Try to match proc calis chisquare = 31.38"; G2 = n * (LogDetX + LogDetY - LogDet); print G2; Chisquare = (n-1)/n * G2; Pvalue = 1 - probchi(Chisquare,18); print G2 Chisquare Pvalue;