% 431Assignment5.tex Large sample LR tests, measurement error \documentclass[11pt]{article} %\usepackage{amsbsy} % for \boldsymbol and \pmb \usepackage{graphicx} % To include pdf files! \usepackage{amsmath} \usepackage{amsbsy} \usepackage{amsfonts} \usepackage[colorlinks=true, pdfstartview=FitV, linkcolor=blue, citecolor=blue, urlcolor=blue]{hyperref} % For links \usepackage{fullpage} %\pagestyle{empty} % No page numbers \begin{document} %\enlargethispage*{1000 pt} \begin{center} {\Large \textbf{STA 431s17 Assignment Five}}\footnote{This assignment was prepared by \href{http://www.utstat.toronto.edu/~brunner}{Jerry Brunner}, Department of Statistical Sciences, University of Toronto. It is licensed under a \href{http://creativecommons.org/licenses/by-sa/3.0/deed.en_US} {Creative Commons Attribution - ShareAlike 3.0 Unported License}. Use any part of it as you like and share the result freely. The \LaTeX~source code is available from the course website: \href{http://www.utstat.toronto.edu/~brunner/oldclass/431s17} {\small\texttt{http://www.utstat.toronto.edu/$^\sim$brunner/oldclass/431s17}}} \vspace{1 mm} \end{center} \noindent This assignment is on large-sample likelihood ratio tests, and the first part of measurement error. The material on likelihood ratio tests is in lecture slide set 8. See also Section A.6.5 in Appendix A, especially pages 176-178. The material on measurement error is in lecture slide set 9. See also Section 0.7 in Chapter Zero, pages 33-38. \begin{enumerate} %%%%%%%%%%%%%%%%%%%%%%%% LR Tests %%%%%%%%%%%%%%%%%%%%%%%% % This is a substantially improved version of the 2015 questions. \item Let $Y_1, \ldots, Y_n$ be a random sample from a distribution with density $f(y) = \frac{1}{\theta} e^{-\frac{y}{\theta}}$ for $y>0$, where the parameter $\theta>0$. We are interested in testing $H_0:\theta=\theta_0$. % Making it easier for this course, This is an exponential distribution and the MLE is $\overline{Y}$. You don't have to re-derive it. \begin{enumerate} \item What is the parameter space $\Theta$? \item What is the restricted parameter space $\Theta_0$? \item Calculate a formula for the log likelihood evaluated at the unrestricted MLE. Simplify. \item Derive a general expression for the large-sample likelihood ratio statistic $G^2$. \item What is the distribution of the test statistic under the null hypothesis? Don't forget the degrees of freedom. \item A sample of size $n=100$ yields $\overline{Y}=1.37$ and $S^2=1.42$. One of these quantities is unnecessary and just provided to irritate you. Well, actually it's a mild substitute for reality, which always provides you with a huge pile of information you don't need. Anyway, we want to test $H_0:\theta=1$. You can do this with a calculator. When I did it a long time ago I got $G^2=11.038$. \item What is the critical value at $\alpha = 0.05$? The answer is a number from the formula sheet. \item Do you reject $H_0$ at $\alpha=0.05$? Answer Yes or No. \item Are the results statistically significant? Answer Yes or No. \item Is there evidence that $\theta \neq 1$? Answer Yes or No. \item Choose one of these conclusions: $\theta<1$, $\theta=1$ or $\theta>1$. \end{enumerate} \item The label on the peanut butter jar says peanuts, partially hydrogenated peanut oil, salt and sugar. But we all know there is other stuff in there too. In the United States, the Food and Drug administration requires that a shipment of peanut butter be rejected if it contains an average of more than 8 rat hairs per pound (well, I'm not sure if it's exactly 8, but let's pretend). There is very good reason to assume that the number of rat hairs per pound has a Poisson distribution with mean $\lambda$, because it's easy to justify a Poisson process model for how the hairs get into the jars. The Poisson probability mass function is $p(y) = \frac{e^{-\lambda}\lambda^y}{y!}$, where $\lambda>0$. The MLE is the sample mean; you don't have to re-derive it. We will test $H_0:\lambda=\lambda_0$. \begin{enumerate} \item What is the parameter space $\Theta$? \item What is the restricted parameter space $\Theta_0$? \item Calculate a formula for the log likelihood evaluated at the unrestricted MLE. Simplify. \item Derive a general expression for the large-sample likelihood ratio statistic $G^2$. \item What is the distribution of the test statistic under the null hypothesis? Don't forget the degrees of freedom. \item We sample 100 1-pound jars, and observe a sample mean of $\overline{Y}= 8.57$ rat hairs. Should we reject the shipment? We want to test $H_0:\lambda=8$. What is the value of $G^2$? The answer is a number. You can do this with a calculator. When I did it a long time ago I got $G^2=3.97$. \item Do you reject $H_0$ at $\alpha=0.05$? Answer Yes or No. \item Do you reject the shipment of peanut butter? Answer Yes or No. \end{enumerate} \item Let $X_1, \ldots, X_n \stackrel{i.i.d}{\sim} N(\mu,\sigma^2)$. We want to test $H_0: \sigma^2 = \sigma^2_0$ versus $ \sigma^2 \neq \sigma^2_0$. You should know the unrestricted MLE for the normal model; you don't have to re-derive it. \begin{enumerate} \item What is the parameter space $\Theta$? \item What is the restricted parameter space $\Theta_0$? \item Calculate a formula for the log likelihood evaluated at the unrestricted MLE. Simplify. \item Derive a general expression for the large-sample likelihood ratio statistic $G^2$. \item What is the distribution of the test statistic under the null hypothesis? Don't forget the degrees of freedom. \item A random sample of size $n=50$ yields $\overline{X}=9.91$ and $\widehat{\sigma}^2 = 0.92$. Calculate the test statistic; the answer is a number. \item What is the critical value of the test statistic at $\alpha=0.05$? \item Do you reject $H_0: \sigma^2=1$ at $\alpha = 0.05$? Answer Yes or No. \item Are the results statistically significant? Answer Yes or No. \item What, if anything, do you conclude? \end{enumerate} % \pagebreak \item You might want to look again at the coffee taste test example from lecture Unit 5 (Estimation) before starting this question. An email spam company designs $k$ different emails, and randomly assigns email addresses (from a huge list they bought somewhere) to receive the different email messages. So, this is a true experiment, in which the message a person receives is the experimental treatment. $n_1$ email addresses receive message 1, $n_2$ email addresses receive message 2, \ldots, and $n_k$ email addresses receive message $k$. The response variable is whether the recipient clicks on the link in the email message: $Y_{ij}=1$ if recipient $i$ in Treatment $j$ clicks on the link, and zero otherwise. According to our model, all these observations are independent, with $P(Y_{ij}) = \theta_j$ for $i = 1, \ldots, n_j$ and $j = 1, \ldots, k$. We want to know if there are any differences in the effectiveness of the treatments. \begin{enumerate} \item What is the parameter space $\Theta$? \item What is the restricted parameter space $\Theta_0$? \item What is $\Theta_0$? \item Write the log likelihood function and simplify. \item What is $\widehat{\boldsymbol{\theta}}$? If you think about it you can write down the answer without doing any work. \item What is $\widehat{\boldsymbol{\theta}}_0$? If you think about it you can write down the answer without doing any work. \item Write down and simplify a general expression for the large-sample likelihood ratio statistic $G^2$. What are the degrees of freedom? \item Comparing three spam messages with $n_1=n_2=n_3=1,000$, the company obtains $\overline{Y}_1=0.044$, $\overline{Y}_2=0.050$ and $\overline{Y}_3=0.061$. \item What is the test statistic $G^2$? The answer is a number. \item What is the critical value at $\alpha = 0.05$? The answer is a number from the formula sheet. \item Do you reject $H_0$ at $\alpha=0.05$? Answer Yes or No. \item Are the results statistically significant? Answer Yes or No. \item Is there evidence that the messages differ in their effectiveness? Answer Yes or No. \end{enumerate} % \pagebreak \item You may think of this as a continuation of Question 5 of Assignment 2. Let $Y_i = \beta x_i + \epsilon_i$ for $i=1, \ldots, n$, where $\epsilon_1, \ldots, \epsilon_n$ are a random sample from a normal distribution with expected value zero and variance $\sigma^2$. The parameters $\beta$ and $\sigma^2$ are unknown constants. The numbers $x_1, \ldots, x_n$ are known, observed constants. \begin{enumerate} \item What is the parameter space $\Theta$? \item If the null hypothesis is $H_0: \beta=\beta_0$, what is $\Theta_0$? \item What is $\widehat{\beta}$? Just use your answer from Assignment 2. \item What is $\widehat{\sigma}^2$? Again just use your answer from Assignment 2. \item What is the restricted MLE of $\beta$? \item What is $\widehat{\sigma}^2_0$? Show your work. \item Show $G^2 = n\ln\frac{\sum_{i=1}^n(Y_i-\beta_0x_i)^2}{\sum_{i=1}^n(Y_i-\widehat{\beta}x_i)^2}$ \end{enumerate} \item Let $\mathbf{D}_1, \ldots, \mathbf{D}_n$ be a random sample from a multivariate normal population with mean $\boldsymbol{\mu}$ and variance-covariance matrix $\boldsymbol{\Sigma}$. Write $\mathbf{D}_i = \left(\begin{array}{c} \mathbf{X}_i \\ \hline \mathbf{Y}_i \end{array} \right)$, where $\mathbf{X}_i$ is $q \times 1$, $\mathbf{Y}_i$ is $r \times 1$, and $p = q+r$, we have $cov(\mathbf{D}_i) = \boldsymbol{\Sigma} = \left( \begin{array}{c|c} \boldsymbol{\Sigma}_x & \boldsymbol{\Sigma}_{xy} \\ \hline \boldsymbol{\Sigma}_{yx} & \boldsymbol{\Sigma}_y \end{array} \right)$, and $\widehat{\boldsymbol{\Sigma}} = \left( \begin{array}{c|c} \widehat{\boldsymbol{\Sigma}}_x & \widehat{\boldsymbol{\Sigma}}_{xy} \\ \hline \widehat{\boldsymbol{\Sigma}}_{yx} & \widehat{\boldsymbol{\Sigma}}_y \end{array} \right)$. We want to test whether the vector of observations $\mathbf{X}_i$ is independent of the vector of observations $\mathbf{Y}_i$. Because zero covariance implies independence for the multivariate normal, the null hypothesis is $H_0: \boldsymbol{\Sigma}_{xy} = \mathbf{0}$. \begin{enumerate} \item Starting from the formula sheet, write down and simplify the log likelihood evaluated at the unrestricted MLE. Your answer is a formula. It's in the lecture slides if you want to check your answer. \item Using the fact that zero covariance implies independence for the multivariate normal, give the restricted MLE $(\widehat{\boldsymbol{\mu}}_0, \widehat{\boldsymbol{\Sigma}}_0)$. If you think about it you can just write down the answer without any calculation. \item Give the log likelihood evaluated at the restricted MLE. It's easy if you use the fact that zero covariance implies independence for the multivariate normal; otherwise, you're essentially re-proving this fact. \item Calculate and simplify the large-sample likelihood ratio statistic $G^2$ for testing $H_0: \boldsymbol{\Sigma}_{xy} = \mathbf{0}$, which is equivalent to $\mathbf{X}_i$ and $\mathbf{Y}_i$ independent. Start with the likelihood and MLEs on the formula sheet. Your answer is a formula. What are the degrees of freedom? % G^2 = n (log|SigmaHatX| + log|SigmaHatY| - log|SigmaHat|), df = qr % \pagebreak \item For example, $\mathbf{X}_i$ could be the vector of three ``mental measurements," namely scores on standardized tests of vocabulary and ability to solve puzzles. $\mathbf{Y}_i$ could be a vector of six physical measurements (head circumference etc.). For $n=74$, I calculated $\ln|\widehat{\boldsymbol{\Sigma}}| = 40.814949$, $\ln|\widehat{\boldsymbol{\Sigma}}_x| = 14.913525$ and $\ln|\widehat{\boldsymbol{\Sigma}}_y| = 26.33133$. \begin{enumerate} \item Calculate $G^2$ for these data. Your answer is a number. My answer is also a number: ??.81304. % n (log|SigmaHatX| + log|SigmaHatY| - log|SigmaHat|) = 31.81304 \item What are the degrees of freedom? Your answer is a number. \item The critical value at $\alpha=0.05$ is not on the formula sheet. It's 28.8693. Do you reject $H_0$? Are the mental and physical characteristics independent? \end{enumerate} \end{enumerate} % Testing difference between covariance matrices for independent groups is too subtle for HW. The hard part is that under H_0 the data are still not i.i.d.. Looking at the trace form of he likelihood, you can see that the last part still disappears. Going back to the original form, see that we should substitute x-bar for mu1 and y-bar for mu2. Centering the sample data by subtracting off x-bar from x and y-bar from y, it's fairly clear that the MLE of the common Sigma is the sample covariance matrix of these centered data. Notation is a problem for me, though I se it and I could compute it. %%%%%%%%%%%%%%%%%%%%%%%% Measurement Error %%%%%%%%%%%%%%%%%%%%%%%% \item\label{measurementbias} In a study of diet and health, suppose we want to know how much snack food each person eats, and we ``measure" it by asking a question on a questionnaire. Surely there will be measurement error, and suppose it is of a simple additive nature. But we are pretty sure people under-report how much snack food they eat, so a model like~$W = X + e$ with $E(e)=0$ is hard to defend. Instead, let \begin{displaymath} W = \nu + X + e, \end{displaymath} where $E(X)=\mu_x$, $E(e)= 0$, $Var(X)=\sigma^2_x$, $Var(e)=\sigma^2_e$, and $Cov(X,e)=0$ The unknown constant $\nu$ could be called \emph{measurement bias}. Calculate the reliability of $W$ for this model. Is it the same as the expression for reliability given in the text and lecture, or does $\nu\neq 0$ make a difference? % Lesson: Assuming expected values and intercepts zero does no harm. \item Continuing Exercise~\ref{measurementbias}, suppose that two measurements of $W$ are available. \begin{eqnarray} W_1 & = & \nu_1 + X + e_1 \nonumber \\ W_2 & = & \nu_2 + X + e_2, \nonumber \end{eqnarray} where $E(X)=\mu_x$, $Var(X)=\sigma^2_x$, $E(e_1)=E(e_2)=0$, $Var(e_1)=Var(e_2)=\sigma^2_e$, and $X$, $e_1$ and $e_2$ are all independent. Calculate $Corr(W_1,W_2)$. Does this correlation still equal the reliability even when $\nu_1$ and $\nu_2$ are non-zero and potentially different from one another? % Yes. Intercepts don't matter. \item\label{goldstandard} Let $X$ be a latent variable, $W = X + e_1$ be the usual measurement of $X$ with error, and $G = X+e_2$ be a measurement of $X$ that is deemed ``gold standard," but of course it's not completely free of measurement error. It's better than $W$ in the sense that $00$. \begin{enumerate} \item Draw a path diagram of the model. \item Show that $Corr(W_1,W_2)$ is strictly \emph{greater} than the reliability. This means that in practice, omitted variables will result in over-estimates of reliability. There are almost always omitted variables. \end{enumerate} \end{enumerate} \end{document}