\documentclass[12pt]{article} %\usepackage{amsbsy} % for \boldsymbol and \pmb \usepackage{graphicx} % To include pdf files! \usepackage{amsmath} \usepackage{amsbsy} \usepackage{amsfonts} \usepackage[colorlinks=true, pdfstartview=FitV, linkcolor=blue, citecolor=blue, urlcolor=blue]{hyperref} % For links %\usepackage{fullpage} %\pagestyle{empty} % No page numbers \oddsidemargin=0in % Good for US Letter paper \evensidemargin=0in \textwidth=6.5in \topmargin=-0.8in \headheight=0in \headsep=0.5in \textheight=9.4in \begin{document} %\enlargethispage*{1000 pt} \begin{center} {\Large \textbf{STA 2101/442 Assignment 7}}\footnote{This assignment was prepared by \href{http://www.utstat.toronto.edu/~brunner}{Jerry Brunner}, Department of Statistical Sciences, University of Toronto. Except for \texttt{bp.data.txt} (which may belong to SPSS) It is licensed under a \href{http://creativecommons.org/licenses/by-sa/3.0/deed.en_US} {Creative Commons Attribution - ShareAlike 3.0 Unported License}. Use it as you like and share the result freely. The \LaTeX~source code is available from the course website: \href{http://www.utstat.toronto.edu/~brunner/oldclass/appliedf17} {\texttt{http://www.utstat.toronto.edu/$^\sim$brunner/oldclass/appliedf17}}} \vspace{1 mm} \end{center} The non-computer questions on this assignment are practice for the quiz on Friday October 27th, and are not to be handed in. Please do the problems using the formula sheet as necessary. A copy of the formula sheet will be distributed with the quiz. As usual, you may use anything on the formula sheet unless you are directly asked to prove it. % MVN LR % Logistic regression % Maybe Poisson regression. \begin{enumerate} \item Looking at the expression for the multivariate normal likelihood on the formula sheet, how can you tell that for \emph{any} fixed positive definite $\boldsymbol{\Sigma}$, the likelihood is greatest when $\boldsymbol{\mu} = \overline{\mathbf{y}}$? \item \label{ind} Based on a random sample of size $n$ from a $p$-dimensional multivariate normal distribution, derive a formula for the large-sample likelihood ratio test statistic $G^2$ for the null hypothesis that $\boldsymbol{\Sigma}$ is diagonal (all covariances between variables are zero). You may use the likelihood function on the formula sheet. You may also use without proof the fact that the unrestricted MLE is $\widehat{\theta}=(\overline{\mathbf{y}},\boldsymbol{\widehat{\Sigma}})$. Hint: Because zero covariance implies independence for the multivariate normal, the joint density is a product of marginals under $H_0$. To be direct, I am suggesting that you \emph{not} use the likelihood function on the formula sheet to calculate the numerator of the likelihood ratio. You'll eventually get the right answer if you insist on doing it that way, but it's a lot more work. % From 431s09 Assignment 4, which has a lot more detail. % \newpage \item \label{heart} The file \href{http://www.utstat.toronto.edu/~brunner/data/illegal/bp.data.txt} {\texttt{http://www.utstat.toronto.edu/$\sim$brunner/data/illegal/bp.data.txt}} has diastolic blood pressure, education, cholesterol, number of cigarettes per day and weight in pounds for a sample of middle-aged men. There are missing values; \texttt{summary} will tell you what they are. Assuming multivariate normality and using R, carry out a large-sample likelihood ratio test to determine whether there are any non-zero covariances among the five variables; guided by the usual $\alpha=0.05$ significance level, what do you conclude? Are the five variables all independent of one another? Answer Yes or No. For this question, let's agree that we will base the sample covariance matrix only on \emph{complete observations}. That is, there will be no missing values on any variable. Don't forget that $\boldsymbol{\widehat{\Sigma}}$, like $\widehat{\sigma}_j^2$, has $n$ in the denominator and not $n-1$. What is $n$? \emph{Bring your printout to the quiz}. \item Here is a useful variation on Problem~\ref{ind}. Suppose $n$ independent and identically data vectors $\mathbf{d}_i = \left(\begin{array}{c} \mathbf{x}_i \\ \hline \mathbf{y}_i \end{array} \right) $ are multivariate normal. The notation means that $\mathbf{d}_i$ is $\mathbf{x}_i$ stacked on top of $\mathbf{y}_i$. For example, $\mathbf{x}_i$ could be physical measurements and $\mathbf{y}_i$ could be psychological measurements. Derive a likelihood ratio test to determine whether $\mathbf{x}_i$ and $\mathbf{y}_i$ are independent. Your answer is a formula for $G^2$ and a formula for the degrees of freedom. Part of the job here is to make up good, simple notation. \pagebreak \item Dead pixels are a big problem in manufacturing computer and cell phone screens. The physics of the manufacturing process dictates that dead pixels happen according to a spatial Poisson process, so that the numbers of dead pixels in cell phone screens are independent Poisson random variables with parameter $\lambda$, the expected number of dead pixels. Naturally, $\lambda$ depends on details of how the screens are manufactured. In an effort to reduce the expected number of dead pixels, six assembly lines were set up, each with a different version of the manufacturing process. A random sample of 50 phones was taken from each assembly line and sent to the lab for testing. Mysteriously, three phones from one assembly line disappeared in transit, and 15 phones from another assembly line disappeared. Sample sizes and sample mean numbers of dead pixels appear in the table below. \begin{verbatim} Manufacturing Process 1 2 3 4 5 6 ----------------------------------------- ybar 10.68 9.87234 9.56 8.52 10.48571 9.98 n 50 47 50 50 35 50 ----------------------------------------- \end{verbatim} \begin{enumerate} \item The first task is to carry out a large sample likelihood ratio test to see whether the expected numbers of dead pixels are different for the six manufacturing processes. Using R, calculate the test statistic and the $p$-value. Also report the degrees of freedom. You are being asked for a computation, but \emph{most of the task is thinking and working things out on paper}. I got away with only five lines of code: One line to enter the means, one line to enter the sample sizes, one line to compute $G^2$, one line to compute the $p$-value, and one other line. Here are some little questions to get you started. \begin{enumerate} \item Is this a between-cases design or a within-cases design? \item Denote the parameter vector by $\boldsymbol{\lambda} = (\lambda_1, \ldots, \lambda_p)^\top$. What is $p$? \item What is the null hypothesis? \item What is the distribution of a sum of independent Poisson random variables? \item What is the distribution of $n_j\overline{Y}_j$? \item What is the likelihood function? Write it down and simplify. \item What is the unrestricted MLE $\widehat{\boldsymbol{\lambda}}$? It's a vector. Work it out if you need to. \item What is the restricted MLE $\widehat{\boldsymbol{\lambda}}_0$? It's a vector. Work it out if you need to. \item Now you are ready to write the test statistic. There are a lot of cancellations. Keep simplifying! \item Now use R to compute the test statistic and $p$-value. For comparison, my $p$-value is $0.01169133$. \emph{Bring your printout to the quiz}. \end{enumerate} \item Clearly we need to follow up this result to see where it came from, but we'll do Wald tests because they are a little easier. As preparation (and to get some exercise), carry out a Wald test of the overall null hypothesis you just tested above. You'll need an estimated asymptotic covariance matrix of $\widehat{\boldsymbol{\lambda}}$, and in this case expressing it in closed form is easier than obtaining and inverting the Hessian. So please do it the easy way. The questions below serve as a guide. \begin{enumerate} \item The asymptotic variance of $\overline{Y}_j$ is just its variance. What is $Var(\overline{Y}_j)$? \item The \emph{estimated} asymptotic variance of $\overline{Y}_j$ is the most natural thing you can imagine. What is it? \item So what's the estimated asymptotic variance-covariance matrix of the random vector $\widehat{\boldsymbol{\lambda}}$? \end{enumerate} Now you can carry out the Wald test. Do it with R, obtaining the test statistic, the degrees of freedom and the $p$-value. \emph{Bring your printout to the quiz}. \item Finally carry out all pairwise comparisons with a Bonferroni correction, protecting the entire family of tests against Type I error at the \emph{joint} $\alpha=0.05$ significance level. In plain language, what do you conclude? Remember, with pairwise comparisons you can always draw directional conclusions. \emph{Bring your printout to the quiz}. \end{enumerate} %%%%%%%%%%%%%%%%%%%%%%%%%%%% Logistic regression %%%%%%%%%%%%%%%%%%%%%%%%%%%% \item If two events have equal probability, the odds ratio equals \underline{~~~~~~}. \item For a multiple logistic regression model, if the value of the kth explanatory variable is increased by c units and everything else remains the same, the odds of Y=1 are \underline{~~~~~~} times as great. Prove your answer. \item For a multiple logistic regression model, let $P(Y_i=1| x_{i,1}, \ldots, x_{i,p-1}) = \pi(\mathbf{x}_i)$. Show that a linear model for the log odds is equivalent to \begin{displaymath} \pi(\mathbf{x}_i) = \frac{e^{\beta_0 + \beta_1 x_1 + \ldots + \beta_{p-1} x_{p-1}}} {1+e^{\beta_0 + \beta_1 x_1 + \ldots + \beta_{p-1} x_{p-1}}} = \frac{e^{\mathbf{x}_i^\top\boldsymbol{\beta}}} {1+e^{\mathbf{x}_i^\top\boldsymbol{\beta}}} \end{displaymath} \item Write the log likelihood for a general logistic regression model, and simplify it as much as possible. Of course use the result of the last question. % \item \label{delta} In the Logistic regression with R slide show, I reproduced the standard error for an estimated probability of passing the course. How did I do it? Show your work. % Not in 2014 \item A logistic regression model with no explanatory variables has just one parameter, $\beta_0$. It also the same probability $\pi = P(Y=1)$ for each case. \begin{enumerate} \item Write $\pi$ as a function of $\beta_0$; show your work. \item The \emph{invariance principle} of maximum likelihood estimation says the MLE of a function of the parameter is that function of the MLE. It is very handy. Now, still considering a logistic regression model with no explanatory variables, \begin{enumerate} \item Suppose $\overline{y}$ (the sample proportion of $Y=1$ cases) is 0.57. What is $\widehat{\beta}_0$? Your answer is a number. % 0.2818512 \item Suppose $\widehat{\beta}_0=-0.79$. What is $\overline{y}$? Your answer is a number. % 0.3121687 \end{enumerate} \end{enumerate} \newpage \item Consider a logistic regression in which the cases are newly married couples with both people from the same religion. The explanatory variables are total family income and religion. Religion is coded A, B, C and None (let's call ``None" a religion), and the response variable is whether the marriage lasted 5 years (1=Yes, 0=No). \begin{enumerate} \item Write a linear model for the log odds of a successful\footnote{I agree, this may be a modest definition of success.} marriage. You do not have to say how the dummy variables are defined. You will do that in the next part. \item Make a table with four rows, showing how you would set up indicator dummy variables for Religion, with None as the reference category. \item Add a column showing the odds of the marriage lasting 5 years. The \emph{symbols} for your dummy variables should not appear in your answer, because they are zeros and ones, and different for each row. But of course your answer contains $\beta$ values. Denote income by $x$. \item For a constant value of income, what is the ratio of the odds of a marriage lasting 5 years or more for Religion C to the odds of lasting 5 years or more for No Religion? Answer in terms of the $\beta$ symbols of your model. \item Holding income constant, what is the ratio of the odds of lasting 5 years or more for religion A to the odds of lasting 5 years or more for Religion B? Answer in terms of the $\beta$ symbols of your model. \item You want to test whether controlling for income, Religion is related to whether the marriage lasts 5 years. State the null hypothesis in terms of one or more $\beta$ values. \item You want to know whether marriages from Religion A are more likely to last 5 years than marriages from Religion C, allowing for income. State the null hypothesis in terms of one or more $\beta$ values. \item You want to test whether marriages between people of No Religion with an average income have a 50-50 chance of lasting 5 years. State the null hypothesis in symbols. To hold income to an ``average" value, just set $x=\overline{x}$. \end{enumerate} \end{enumerate} \end{document} \item \begin{enumerate} \item \end{enumerate} % Zipper identifiability \item \label{poisson} Men and women are calling a technical support line according to independent Poisson processes with rates $\lambda_1$ and $\lambda_2$ per hour. Data for 144 hours are available, but unfortunately the sex of the caller was not recorded. All we have is the number of callers for each hour, which is distributed Poisson($\lambda_1+\lambda_2$). Here are the data, which are also available in the file \href{http://www.utstat.toronto.edu/~brunner/data/legal/poisson.data.txt} {\texttt{http://www.utstat.toronto.edu/$\sim$brunner/data/legal/poisson.data.txt}}. \begin{verbatim} 12 9 25 11 22 16 17 8 14 17 14 17 8 14 18 16 13 17 13 11 9 15 18 14 16 17 16 19 13 18 12 12 13 10 8 15 13 11 15 15 6 10 13 13 11 13 9 15 16 9 5 10 8 18 13 17 7 13 13 17 12 17 14 16 6 12 17 10 9 14 11 19 13 17 15 20 14 10 13 14 17 9 13 14 7 16 16 9 25 10 10 9 17 7 15 12 14 21 14 18 14 12 13 15 12 11 16 14 15 16 8 19 13 17 15 11 18 13 12 11 19 14 16 17 13 13 19 19 11 19 10 12 9 18 11 14 9 14 14 14 13 9 13 18 \end{verbatim} \begin{enumerate} \item The parameter in this problem is $\boldsymbol{\theta} = (\lambda_1,\lambda_2)^\top$. Try to find the MLE analytically. Show your work. Are there any points in the parameter space where both partial derivatives are zero? \item Now try to find the MLE numerically with R's \texttt{nlm} function\footnote{When I did this, I got lots of warning messages with some starting values, when the search repeatedly left the parameter space and then bounced back in. For this problem, you don't need to worry about the warnings as long as the exit code is one.}. The Hessian is interesting; ask for it. Try two different starting values. Compare the minus log likelihoods at your two answers. What seems to be happening here? \item Try inverting the Hessian to get the asymptotic covariance matrix. Any comments? \item To understand what happened in the last item, calculate the Fisher information in a single observation from the definition. That is, letting $\ell = \log f(Y;\boldsymbol{\theta})$, calculate the elements of the $2 \times 2$ matrix whose $(i,j)$ element is \begin{displaymath} -E\left(\frac{\partial^2 \ell}{\partial\theta_i\partial\theta_j}\right) \end{displaymath} \item The Fisher information in the sample is just $n$ times the Fisher information in a single observation. Using the numerical MLEs from one of your \texttt{nlm} runs, estimate this quantity (a $2 \times 2$ matrix). Compare it to the Hessian. Now do you see what happened when you tried to calculate the asymptotic covariance matrix? \item Why did estimation fail for this fairly realistic model? \end{enumerate} Bring your printout to the quiz. Your printout must \emph{not} contain answers to the non-computer parts of this question. That is, it must contain only numerical answers. % 2014 A9 has birdlung. Birthweight is good too, and may be an R dataset.