\documentclass[12pt]{article} %\usepackage{amsbsy} % for \boldsymbol and \pmb \usepackage{graphicx} % To include pdf files! \usepackage{amsmath} \usepackage{amsbsy} \usepackage{amsfonts} \usepackage[colorlinks=true, pdfstartview=FitV, linkcolor=blue, citecolor=blue, urlcolor=blue]{hyperref} % For links \usepackage{fullpage} %\pagestyle{empty} % No page numbers \begin{document} %\enlargethispage*{1000 pt} \begin{center} {\Large \textbf{STA 2101/442 Assignment Four}}\footnote{Copyright information is at the end of the last page.} \vspace{1 mm} \end{center} \noindent Please bring your R printouts for questions \ref{SATmean}, \ref{SATvar} and \ref{boot} to the quiz. Please print them separately, because you may be asked to hand in just one of them. The other questions are just practice for the quiz, and are not to be handed in, though you may use R as a calculator. Bring a real calculator to the quiz. \begin{enumerate} %%%%%%%%%% Wald-like Test %%%%%%%%%% \item The statistic \begin{displaymath} W_n = n \left( \mathbf{LT}_n - \mathbf{h} \right)^\prime \left(\mathbf{L}\widehat{\boldsymbol{\Sigma}}_n \mathbf{L}^\prime \right)^{-1} \left(\mathbf{LT}_n - \mathbf{h} \right) \end{displaymath} provides a large-sample test of $H_0:\mathbf{L}\boldsymbol{\theta}=\mathbf{h}$. Let $X_1, \ldots, X_n$ be a random sample from a $B(1,\theta)$ distribution. \begin{enumerate} \item Write down and simplify the $W_n$ statistic for testing $H_0: \theta = \theta_0$ versus $H_1: \theta \neq \theta_0$. \item Your answer is related to one of the test statistics $Z_1$ and $Z_2$ introduced earlier for testing this same null hypothesis. To which one is $W_n$ related, and how is it related? % W_n = Z_2^2 \end{enumerate} \item \label{SATmean} In Assignment One, you tested difference between means for the Verbal SAT and the Math SAT score. A link to the SAT data is available from the course home page. Using R, please calculate the $W_n$ statistic to test this hypothesis. Feel free to use my code directly. Note that the statistic $\mathbf{T}_n$ is of dimension \emph{two}. Guided by the usual $\alpha=0.05$ significance level, what do you conclude? Be able to state your conclusion in plain, non-statistical language. Bring your R printout to the quiz. \item \label{SATvar} Suppose you want to test for differences between \emph{variances} of Verbal SAT and Math SAT, again with minimal assumptions about their joint distribution. The idea is to first convince yourself that the joint distribution of $\widehat{\sigma}^2_1$ and $\widehat{\sigma}^2_2$ is asymptotically bivariate normal. Then, if you can get a good estimate of their asymptotic covariance matrix, you can use the $W_n$ test. For simplicity, let $\widehat{\sigma}^2_1$ and $\widehat{\sigma}^2_2$ have $n$ in the denominator rather than $n-1$. \begin{enumerate} \item Write down a formula for one of the sample variances, say $\widehat{\sigma}^2_1$, in a form that shows how it's a continuous function of a collection of sample means. \item Letting $X_i$ denote performance on the Verbal SAT and $Y_i$ denote performance on the Math SAT, we want a data vector $\mathbf{D}_i$ to which we can apply the Central Limit Theorem. Then, we would write $\widehat{\sigma}^2_1 - \widehat{\sigma}^2_2 = g(\overline{\mathbf{D}}_n)$ and the delta method would establish asymptotic normality. Show you know what's going on by writing down the data vector $\mathbf{D}_i$. \item It's too much work to calculate the Jacobian and then estimate all the moments in the asymptotic covariance matrix, so we'll use the Bootstrap instead. Let's all use a bootstrap sample size of $B=1,000$. Using R, calculate an estimated asymptotic covariance matrix. \item Using R, please calculate the $W_n$ statistic to test $H_0: \sigma^2_1=\sigma^2_2$. \item Guided by the usual $\alpha=0.05$ significance level, what do you conclude? Be able to state your conclusion in plain, non-statistical language. \end{enumerate} Bring your R printout to the quiz. \item A team of botanists grew fungus in a nutrient solution in test tubes. Each day for seven days, one of their graduate students carefully measured the length of the fungus in each of $n$ tubes. The scientists were interested in lots of things, including whether average growth was linear or not. Denote the expected amount of fungus at day $j$ by $\mu_j$. \begin{enumerate} \item What is the null hypothesis, in symbols? \item Assuming that the scientists wish to make as few assumptions as possible and $n$ is large, the $W_n$ statistic is natural for this problem. What is $\mathbf{T}_n$? \item What is $\mathbf{L}$? \item What is $\mathbf{h}$? \item What is a convenient choice for $\widehat{\boldsymbol{\Sigma}}_n$? How many rows and columns? \end{enumerate} % \newpage %%%%%%%%%% Multinomial %%%%%%%%%%%% \item Ten friends have a party right after graduating from university. At the time, none of them has ever been married. The party includes a visit by a fortune teller, who says ``Five years from now, 3 of you will still be unmarried, 3 of you will be married for the first time, 2 will be divorced, one will be married for the second time, and one will be widowed." How many ways are there for this to happen? The answer is a number. Show your work. \item A fair die is tossed 8 times. What is the probability of observing the numbers 3 and 4 twice each, and the others once each? The answer is a number. % about 0.006, stolen from Schaum's outline with slight changes. \item A box contains 5 red, 3 white and two blue marbles. A sample of six marbles is drawn with replacement. Find the probability that \begin{enumerate} \item 3 are red, 2 are white and one is blue % 0.135 \item 2 are red, 3 are white and 1 is blue % 0.0810 \item 2 of each colour appears. % 0.0810 \end{enumerate} All the answers are numbers. % Stolen from Schaum's outline without modificaton except for the spelling of colour. \item Let $\mathbf{Y}_1, \ldots, \mathbf{Y}_n$ be a random sample from a $M\left(1,(\theta_1, \ldots ,\theta_c)\right)$ distribution. Show why the likelihood function is written $L(\boldsymbol{\theta}) = \theta_1^{n_1} \theta_2^{n_2} \cdots \theta_c^{n_c}$. \item Let $\mathbf{Y}_1, \ldots, \mathbf{Y}_n$ be a random sample from a $M\left(1,(\theta_1,\theta_2,\theta_3)\right)$ distribution. Find the maximum likelihood estimator of $(\theta_1,\theta_2,\theta_3)$. Show \emph{all} your work. %%%%%%%%%% Multivariate Delta Method %%%%%%%%%%%% \item Let $X_1, \ldots, X_n$ be a random sample from an unknown distribution with expected value $\mu$ and variance one, and independently, let $Y_1, \ldots, Y_n$ be another random sample from the same distribution (with the same $n$). Find the asymptotic distribution of $\frac{\overline{X}_n}{\overline{Y}_n}$. What happens when $\mu=0$? %\newpage \item In a political poll, a random sample of registered voters indicate which party they generally like most: Conservative, NDP or Liberal (other preferences were indicated by a small number of respondents; they are excluded from this analysis). A multinomial model seems reasonable for these data, with $n_1$, $n_2$ and $n_3$ denoting the number who chose Conservative, NDP and Liberal respectively. Of course $n=n_1+n_2+n_3$. The \emph{odds} of an event is the probability of the event divided by one minus the probability. Take the natural log and you have the \emph{log odds}, a quantity that has a prominent role in categorical data analysis. \begin{enumerate} \item Give consistent estimators of the log odds of supporting the Conservatives and the log odds of supporting the NDP. How do you know the estimators are consistent? \item Find the approximate large sample \emph{joint} distribution of the two log odds estimators. Show your work. The covariance matrix has a fairly nice form. \item Express your answer to the last part by saying ``They're approximately bivariate normal (what else?) with expected value \dots " \item \label{voters} Suppose that in a random sample of 200 voters, 91 chose the Conservatives, 71 the NDP and 38 the liberals. Give \begin{enumerate} \item The estimated asymptotic covariance matrix of the estimators. Your answer is a $3 \times 3$ matrix of numbers. Show your work. \item A point estimate of odds (not log odds) of choosing the NDP. The answer is one number. \item a 95\% confidence interval for the log odds of choosing the NDP. The answer is a pair of numbers. \item Using your answer to the last part (the accepted way to do it), give a 95\% confidence interval for the odds (not log odds) of choosing the NDP. The answer is a pair of numbers. \end{enumerate} \item \label{boot} In Question~\ref{voters}, your estimated asymptotic covariance matrix was based on the delta method. Produce another estimated asymptotic covariance matrix using the bootstrap. Let's all use a bootstrap sample size of $B=1,000$. Bring your R printout to the quiz. I used the \texttt{rmultinom} function to do my bootstrap. See \texttt{help(rmultinom)} and think about it. Here's some output from two independent runs of my bootstrap. They are close to my delta method estimate. Please do \emph{not} use my crazy names \texttt{logoddscon} and \texttt{logoddsndp}. \begin{verbatim} logoddscon logoddsndp logoddscon 0.02051701 -0.01420638 logoddsndp -0.01420638 0.02196846 logoddscon logoddsndp logoddscon 0.02032482 -0.01431120 logoddsndp -0.01431120 0.02160154 \end{verbatim} \end{enumerate} % \item Need a between, maybe next time with LR too! \end{enumerate} \vspace{140mm} \noindent \begin{center}\begin{tabular}{l} \hspace{6in} \\ \hline \end{tabular}\end{center} This assignment was prepared by \href{http://www.utstat.toronto.edu/~brunner}{Jerry Brunner}, Department of Statistics, University of Toronto. It is licensed under a \href{http://creativecommons.org/licenses/by-sa/3.0/deed.en_US} {Creative Commons Attribution - ShareAlike 3.0 Unported License}. Use any part of it as you like and share the result freely. The \LaTeX~source code is available from the course website: \href{http://www.utstat.toronto.edu/~brunner/oldclass/appliedf13} {\texttt{http://www.utstat.toronto.edu/$^\sim$brunner/oldclass/appliedf13}} \end{document} # Political bootstrap ybar = c(91,71,38)/200; ybar B = 1000; set.seed(32448) D = t(rmultinom(B,200,c(91,71,38)))/200 # Matrix of bootstrapped Y-bar vectors colnames(D) = c("Conservatives","NDP","Liberals") logoddscon = log(D[,1]/(1-D[,1])) logoddsndp = log(D[,2]/(1-D[,2])) oddstar = cbind(logoddscon, logoddsndp) v = var(oddstar); v #### Two Tries ### > B = 1000; set.seed(32448) > D = t(rmultinom(B,200,c(91,71,38)))/200 # Matrix of bootstrapped Y-bar vectors > colnames(D) = c("Conservatives","NDP","Liberals") > logoddscon = log(D[,1]/(1-D[,1])) > logoddsndp = log(D[,2]/(1-D[,2])) > oddstar = cbind(logoddscon, logoddsndp) > v = var(oddstar); v logoddscon logoddsndp logoddscon 0.02051701 -0.01420638 logoddsndp -0.01420638 0.02196846 > > D = t(rmultinom(B,200,c(91,71,38)))/200 # Matrix of bootstrapped Y-bar vectors > colnames(D) = c("Conservatives","NDP","Liberals") > logoddscon = log(D[,1]/(1-D[,1])) > logoddsndp = log(D[,2]/(1-D[,2])) > oddstar = cbind(logoddscon, logoddsndp) > v = var(oddstar); v logoddscon logoddsndp logoddscon 0.02032482 -0.01431120 logoddsndp -0.01431120 0.02160154 Compared to a delta method calculqtion of logoddscon logoddsndp logoddscon 0.0202 -0.0142 logoddsndp -0.0142 0.0218 # Old, wrong B = 1000; set.seed(32448) Dstar = t(rmultinom(B,200,c(91,71,38)))/200 colnames(Dstar) = c("Conservatives","NDP","Liberals") v1 = var(Dstar[,1:2]); v1 # Again Dstar2 = t(rmultinom(B,200,c(91,71,38)))/200 colnames(Dstar2) = c("Conservatives","NDP","Liberals") v2 = var(Dstar2[,1:2]); v2