\documentclass[12pt]{article} %\usepackage{amsbsy} % for \boldsymbol and \pmb \usepackage{graphicx} % To include pdf files! \usepackage{amsmath} \usepackage{amsbsy} \usepackage{amsfonts} \usepackage[colorlinks=true, pdfstartview=FitV, linkcolor=blue, citecolor=blue, urlcolor=blue]{hyperref} % For links \usepackage{fullpage} %\pagestyle{empty} % No page numbers \begin{document} %\enlargethispage*{1000 pt} \begin{center} {\Large \textbf{STA 2101/442 Assignment Four}}\footnote{Copyright information is at the end of the last page.} \vspace{1 mm} \end{center} \noindent Please bring your R printouts for questions \ref{numle} and \ref{lrtest} to the quiz. The other questions are just practice for the quiz, and are not to be handed in, though you may use R as a calculator. Bring a real calculator to the quiz. \begin{enumerate} %%%%%%%%%% Multinomial %%%%%%%%%%%% \item Ten friends have a party right after graduating from university. At the time, none of them has ever been married. The party includes a visit by a fortune teller, who says ``Five years from now, 3 of you will still be unmarried, 3 of you will be married for the first time, 2 will be divorced, one will be married for the second time, and one will be widowed." How many ways are there for this to happen? The answer is a number. Show your work. \item A fair die is tossed 8 times. What is the probability of observing the numbers 3 and 4 twice each, and the others once each? The answer is a number. % about 0.006, stolen from Schaum's outline with slight changes. \item A box contains 5 red, 3 white and two blue marbles. A sample of six marbles is drawn with replacement. Find the probability that \begin{enumerate} \item 3 are red, 2 are white and one is blue % 0.135 \item 2 are red, 3 are white and 1 is blue % 0.0810 \item 2 of each colour appears. % 0.0810 \end{enumerate} All the answers are numbers. % Stolen from Schaum's outline without modificaton except for the spelling of colour. \item Let $\mathbf{Y}_1, \ldots, \mathbf{Y}_n$ be a random sample from a $M\left(1,(\theta_1, \ldots ,\theta_c)\right)$ distribution. Show why the likelihood function is written $L(\boldsymbol{\theta}) = \theta_1^{n_1} \theta_2^{n_2} \cdots \theta_c^{n_c}$. %%%%%%%%%% Multivariate Delta Method %%%%%%%%%%%% \item Let $\mathbf{Y}_1, \ldots, \mathbf{Y}_n$ be a random sample from a $M\left(1,(\theta_1,\theta_2,\theta_3)\right)$ distribution. Find the maximum likelihood estimator of $(\theta_1,\theta_2,\theta_3)$. Show \emph{all} your work. \item Let $X_1, \ldots, X_n$ be a random sample from a $N(\mu_1,\sigma^2_1)$ distribution, and $Y_1, \ldots, Y_n$ (same $n$) be a random sample from a $N(\mu_2,\sigma^2_2)$ distribution, independent of the first one. Find the asymptotic distribution of $\frac{\overline{X}_n}{\overline{Y}_n}$. What happens when $\mu_2=0$? %\newpage \item In a political poll, a random sample of registered voters indicate which party they generally like most: Conservative, NDP or Liberal (other preferences were indicated by a small number of respondents; they are excluded from this analysis). A multinomial model seems reasonable for these data, with $n_1$, $n_2$ and $n_3$ denoting the number who chose Conservative, NDP and Liberal respectively. Of course $n=n_1+n_2+n_3$. The \emph{odds} of an event is the probability of the event divided by one minus the probability. Take the natural log and you have the \emph{log odds}, a quantity that has a prominent role in categorical data analysis. \begin{enumerate} \item Give consistent estimators of the log odds of supporting the Conservatives and the log odds of supporting the NDP. How do you know the estimators are consistent? \item Find the approximate large sample \emph{joint} distribution of the two log odds estimators. Show your work. The covariance matrix has a fairly nice form. \item Express your answer to the last part by saying ``They're approximately bivariate normal (what else?) with expected value \dots " \item Suppose that in a random sample of 200 voters, 91 chose the Conservatives, 71 the NDP and 38 the liberals. Give \begin{enumerate} \item A point estimate of odds (not log odds) of choosing the NDP. The answer is one number. \item a 95\% confidence interval for the log odds of choosing the NDP. The answer is a pair of numbers. \item Using your answer to the last part (the accepted way to do it), give a 95\% confidence interval for the odds of choosing the NDP. The answer is a pair of numbers. \end{enumerate} \end{enumerate} %%%%%%%%%% Likelihood 1 %%%%%%%%%%%% \item \label{numle} For each of the following distributions and associated data sets, obtain the maximum likelihood estimate numerically with \texttt{R}. Bring your printout for each problem to the quiz; you may be asked to hand it in. There are links to the data from the course web page in case the ones from this document do not work. \begin{enumerate} \item $f(x) = \frac{1}{\pi[1+(x-\theta)^2]}$ for $x$ real, where $-\infty < \theta < \infty$. Data: % 50% mixture of Cauchy(-5) and Cauchy(-5): Two local maxima % +4.263357 and -3.719397, global at latter. Signs of data switched from 2011, % which should make it more interesting. \begin{verbatim} -3.77 -3.57 4.10 4.87 -4.18 -4.59 -5.27 -8.33 5.55 -4.35 -0.55 5.57 -34.78 5.05 2.18 4.12 -3.24 3.78 -3.57 4.86 \end{verbatim} You might want to read the data from \href{http://www.utstat.toronto.edu/~brunner/appliedf12/data/cauchy.data} {\texttt{cauchy.data}}. For this one, try at least two different starting values and \emph{plot the minus log likelihood function!} % \item \label{beta} $f(x) = \frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha) \Gamma(\beta)} x^{\alpha-1}(1-x)^{\beta-1}$ for $00$ and $\beta>0$. Data: % Beta(10,20) Warn bounds \begin{verbatim} 0.45 0.42 0.38 0.26 0.43 0.24 0.32 0.50 0.44 0.29 0.45 0.29 0.29 0.32 0.30 0.32 0.30 0.38 0.43 0.35 0.32 0.33 0.29 0.20 0.46 0.31 0.35 0.27 0.29 0.46 0.43 0.37 0.32 0.28 0.20 0.26 0.39 0.35 0.35 0.24 0.36 0.28 0.32 0.23 0.25 0.43 0.30 0.43 0.33 0.37 \end{verbatim} You might want to read the data from \href{http://www.utstat.toronto.edu/~brunner/appliedf12/data/beta.data} {\texttt{beta.data}}. If you are getting a lot of warnings, maybe it's because the numerical search is leaving the parameter space. If so, try \texttt{help(nlminb)}. \newpage \item $f(x) = \frac{\theta e^{\theta(x-\mu)}}{(1+e^{\theta(x-\mu)})^2}$ for $x$ real, where $-\infty < \mu < \infty$ and $\theta > 0$. Data: \begin{verbatim} 4.82 3.66 4.39 1.66 3.80 4.69 1.73 4.50 9.29 4.05 4.50 -0.64 1.40 4.18 2.70 5.65 5.47 0.55 4.64 1.19 2.28 7.16 4.80 3.19 2.33 2.57 2.31 0.35 2.81 2.35 2.52 3.44 2.71 -1.43 7.61 0.93 2.52 6.86 6.14 4.37 3.79 5.04 4.50 1.92 3.25 -0.06 2.81 3.09 2.95 3.69 \end{verbatim} % Logistic mu = 3, theta=1, median 3.22, mle = % $estimate % [1] 3.3392004 0.8760226 You might want to read the data from \href{http://www.utstat.toronto.edu/~brunner/appliedf12/data/mystery.data} {\texttt{mystery.data}}. \vspace{2mm} \item $f(x) = \frac{1}{m!} e^{-x} x^m$ for $x>0$, where the unknown parameter $m$ is a positive integer. \emph{This means your estimate will be an integer.} Data: % Gamma m=5 so alpha=6,beta=1 \begin{verbatim} 8.34 7.65 6.72 3.84 7.12 1.88 5.07 2.69 4.50 5.78 4.88 5.23 6.17 11.76 7.84 5.87 5.23 6.55 8.34 5.35 4.98 13.81 8.62 7.88 6.34 5.16 6.64 4.35 6.77 5.83 5.85 2.46 8.33 3.74 5.10 3.95 7.84 4.70 6.09 5.23 1.44 6.11 4.88 7.24 7.89 8.98 1.78 5.46 5.34 4.25 \end{verbatim} You might want to read the data from \href{http://www.utstat.toronto.edu/~brunner/appliedf12/data/gamma.data} {\texttt{gamma.data}}. \end{enumerate} For each distribution, be able to state (briefly) why differentiating the log likelihood and setting the derivative to zero does not work. For the computer part, bring to the quiz one sheet of printed output for each distribution. The sheets should be separate, because you may hand only one of them in. Each printed page should show the following, \emph{in this order}. \begin{itemize} \item Definition of the function that computes the likelihood, or log likelihood, or minus log likelihood or whatever. \item How you got the data into R -- probably a \texttt{scan} statement. \item Listing of the data for the problem. \item The \texttt{nlm} or \texttt{nlminb} statement and resulting output. \item For the Cauchy example, a plot of the minus log likelihood. \end{itemize} \item \label{lrtest} For the data of Problem~\ref{beta}, conduct a large-sample likelihood ratio test of $H_0:\alpha=\beta$, using \texttt{R}. Your printout should display the value of $G^2$, the degrees of freedom and the $p$-value. Do you reject $H_0$ at the $0.05$ significance level? If yes, which parameter seems to be higher based on $\widehat{\alpha}$ and $\widehat{\beta}$? \end{enumerate} %\vspace{30mm} \noindent \begin{center}\begin{tabular}{l} \hspace{6in} \\ \hline \end{tabular}\end{center} This assignment was prepared by \href{http://www.utstat.toronto.edu/~brunner}{Jerry Brunner}, Department of Statistics, University of Toronto. It is licensed under a \href{http://creativecommons.org/licenses/by-sa/3.0/deed.en_US} {Creative Commons Attribution - ShareAlike 3.0 Unported License}. Use any part of it as you like and share the result freely. The \LaTeX~source code is available from the course website: \href{http://www.utstat.toronto.edu/~brunner/oldclass/appliedf12} {\texttt{http://www.utstat.toronto.edu/$^\sim$brunner/oldclass/appliedf12}} \end{document} q8c: # Working on the mystery data. My records indicate true # mu = 3, theta=1. Sample median = 3.22 datta = scan("http://www.utstat.toronto.edu/%7Ebrunner/appliedf12/data/mystery.data") # Minus LL for mystery (logistic) distribution of Q8c, HW4 mll = function(theta,x) { mu = theta[1]; T = theta[2] n = length(x); xbar = mean(x) mll = - n * (log(T) + T*(xbar-mu)) + 2 * sum(log(1+exp(T*(x-mu)))) mll } # End function mll # Density is symmetric about mu, so xbar is a good start for mu xbar = mean(datta); xbar search1 = nlm(mll,c(xbar,1),hessian=T,x=datta); search1 eigen(search1$hessian)$values # Try starting at the end point of search1 search2 = nlm(mll,search1$estimate,hessian=T,x=datta); search2 #$ # Start someplace far from the truth search3 = nlm(mll,c(7,7),hessian=T,x=datta); search3 % Maybe later \item \label{coffee} A fast food chain is considering a change in the blend of coffee beans they use to make their coffee. To determine whether their customers prefer the new blend, the company selects a random sample of coffee-drinking customers and asks them to taste coffee made with the old blend and two new alternatives, in cups marked ``A," ``B" and ``C." Labels are in a different random order for each customer. \begin{enumerate} \item Derive a general formula for the (unrestricted) maximum likelihood estimator (MLE) of $\boldsymbol{\theta} = (\theta_1, \theta_2, \theta_3)$. Show your work. In case you didn't realize it, this is a multinomial. \item Suppose $n=150$ consumers participate in the taste test. Thirty-eight prefer the old blend, 55 prefer new alternative One, and 57 prefer new alternative Two. Carry out a large-sample likelihood ratio test to determine whether there is a real difference in preference for the blends of coffee; use $\alpha=0.05$. Use \texttt{R}. Your printout should display the value of $G^2$, the degrees of freedom and the $p$-value. Do you reject $H_0$ at the $0.05$ significance level? \end{enumerate} %%%%% \item \label{mvn} Based on a random sample of size $n$ from a $p$-dimensional multivariate normal distribution, derive a formula for the large-sample likelihood ratio test statistic $G^2$ for the null hypothesis that $\boldsymbol{\Sigma}$ is diagonal (all covariances between variables are zero). You may use material on the slides from the multivariate normal lecture (including a useful version of the likelihood function), which will be provided with the quiz if this question is asked. You may also use the fact that the unrestricted MLE is $\widehat{\theta}=(\overline{\mathbf{x}},\boldsymbol{\widehat{\Sigma}})$, where \begin{displaymath} \boldsymbol{\widehat{\Sigma}} = \frac{1}{n}\sum_{i=1}^n (\mathbf{x}_i-\overline{\mathbf{x}}) (\mathbf{x}_i-\overline{\mathbf{x}})^\prime . \end{displaymath} Hint: Because zero covariance implies independence for the multivariate normal, the joint density is a product of marginals under $H_0$. % From 431s09 Assignment 4, which has a lot more detail. \item In the United States, admission to university is based partly on high school marks and recommendations, and partly on applicants’ performance on a standardized multiple choice test called the Scholastic Aptitude Test (SAT). The SAT has two sub-tests, Verbal and Math. A university administrator selected a random sample of 200 applicants, and recorded the Verbal SAT, the Math SAT and first-year university Grade Point Average (GPA) for each student. The data are given in the file \href{http://www.utstat.toronto.edu/~brunner/appliedf12/data/sat.data} {\texttt{sat.data}}. There is a link on the course web page in case the one in this document does not work. Using R, carry out a large-sample likelihood ratio test to determine whether there are any non-zero covariances among the three variables; use $\alpha=0.05$. \begin{enumerate} \item Calculate the test statistic $G^2$ and the $p$-value; also give the degrees of freedom. Your answer is a set of 3 numbers. \item Do you reject $H_0$? Answer Yes or No. \item Are the three variables all independent of one another? Answer Yes or No. \end{enumerate} \item The table below shows the joint probability distribution of the two binary random variables $X$ and $Y$, with $P(X=i,Y=j)=\theta_{ij}$. \begin{center} \begin{tabular}{|l|c|c|} \hline & $Y=1$ & $Y=2$ \\ \hline $X=1$ & $\theta_{11}$ & $\theta_{12}$ \\ \hline $X=2$ & $\theta_{21}$ & $\theta_{22}$ \\ \hline \end{tabular} \end{center} A random sample yields the following observed frequencies. This is a \emph{contingency table}. \begin{center} \begin{tabular}{|l|c|c|} \hline & $Y=1$ & $Y=2$ \\ \hline $X=1$ & $n_{11}$ & $n_{12}$ \\ \hline $X=2$ & $n_{21}$ & $n_{22}$ \\ \hline \end{tabular} \end{center} This means that $X=1$ and $Y=1$ for $n_{11}$ of the $n$ individuals in the sample, where $n = n_{11}+ n_{12}+ n_{21}+ n_{22}$.