\documentclass[11pt]{article} %\usepackage{amsbsy} % for \boldsymbol and \pmb \usepackage{graphicx} % To include pdf files! \usepackage{amsmath} \usepackage{amsbsy} \usepackage{amsfonts} \usepackage[colorlinks=true, pdfstartview=FitV, linkcolor=blue, citecolor=blue, urlcolor=blue]{hyperref} % For links \usepackage{fullpage} %\pagestyle{empty} % No page numbers \begin{document} %\enlargethispage*{1000 pt} \begin{center} {\Large \textbf{STA 431s15 Assignment Six}}\footnote{This assignment was prepared by \href{http://www.utstat.toronto.edu/~brunner}{Jerry Brunner}, Department of Statistical Sciences, University of Toronto. It is licensed under a \href{http://creativecommons.org/licenses/by-sa/3.0/deed.en_US} {Creative Commons Attribution - ShareAlike 3.0 Unported License}. Use any part of it as you like and share the result freely. The \LaTeX~source code is available from the course website: \href{http://www.utstat.toronto.edu/~brunner/oldclass/431s15} {\small\texttt{http://www.utstat.toronto.edu/$^\sim$brunner/oldclass/431s15}}} \vspace{1 mm} \end{center} \noindent The non-computer questions on this assignment are for practice, and will not be handed in. For the SAS part of this assignment (Question~\ref{SASpig}) please bring your log file and your output file to the quiz. There may be one or more questions about them, and you may be asked to hand the printouts in with the quiz. \begin{enumerate} \item In the lecture notes, look at the matrix formulation of double measurement regression. As usual, expected values and intercepts are not identifiable, so confine your attention to the covariance matrix. % In this question, watch out for Stage 1 and Stage 2. I intend to switch the order (2013). But I didn't: 2015, at least not in lecture. \begin{enumerate} \item How many unknown parameters appear in $\boldsymbol{\Sigma}$? The answer is an expression in $p$ and $q$. \item How many unique variances and covariances are there in $\boldsymbol{\Phi} = V(\mathbf{F}_i)$? The answer is an expression in $p$ and $q$. \item In total, how many unknown parameters are there in the Stage One matrices $\boldsymbol{\Phi}_x$, $\boldsymbol{\beta}_1$ and $\boldsymbol{\Psi}$? The answer is an expression in $p$ and $q$. Is this the same as your last answer? If so, it means that at the first stage, if the parameters are identifiable from $\boldsymbol{\Phi}$, they are \emph{just identifiable} from $\boldsymbol{\Phi}$. \item In Stage One (the latent variable model), show the details of how the parameter matrices $\boldsymbol{\Phi}_x$, $\boldsymbol{\beta}_1$ and $\boldsymbol{\Psi}$ can be recovered from $\boldsymbol{\Phi}$. \item It Stage Two (the measurement model), the parameters are in the matrices $\boldsymbol{\Phi}$, $\boldsymbol{\Omega}_1$ and $\boldsymbol{\Omega}_2$. How many unique parameters are there? The answer is an expression in $p$ and $q$. \item How many unique variances and covariances are there in $\boldsymbol{\Sigma}$? The answer is an expression in $p$ and $q$. \item \label{nconstr} How many equality constraints are imposed on $\boldsymbol{\Sigma}$ by the model? The answer is an expression in $p$ and $q$. \item Show that the number of parameters plus the number of constraints is equal to the number of unique variances and covariances in $\boldsymbol{\Sigma}$. This is a brief calculation using your earlier answers. \end{enumerate} % \newpage \item Here is a one-stage formulation of the double measurement regression model. % See the text for some discussion. Independently for $i=1, \ldots, n$, let \begin{eqnarray*} \mathbf{W}_{i,1} & = & \mathbf{X}_i + \mathbf{e}_{i,1} \\ \mathbf{V}_{i,1} & = & \mathbf{Y}_i + \mathbf{e}_{i,2} \nonumber \\ \mathbf{W}_{i,2} & = & \mathbf{X}_i + \mathbf{e}_{i,3}, \nonumber \\ \mathbf{V}_{i,2} & = & \mathbf{Y}_i + \mathbf{e}_{i,4}, \nonumber \\ \mathbf{Y}_i & = & \boldsymbol{\beta} \mathbf{X}_i + \boldsymbol{\epsilon}_i \nonumber \end{eqnarray*} where \begin{itemize} \item[] $\mathbf{Y}_i$ is a $q \times 1$ random vector of latent response variables. Because $q$ can be greater than one, the regression is multivariate. \item[] $\boldsymbol{\beta}$ is an $q \times p$ matrix of unknown constants. These are the regression coefficients, with one row for each response variable and one column for each explanatory variable. \item[] $\mathbf{X}_i$ is a $p \times 1$ random vector of latent explanatory variables, with expected value zero and variance-covariance matrix $\boldsymbol{\Phi}$, a $p \times p$ symmetric and positive definite matrix of unknown constants. \item[] $\boldsymbol{\epsilon}_i$ is the error term of the latent regression. It is a $q \times 1$ random vector with expected value zero and variance-covariance matrix $\boldsymbol{\Psi}$, a $q \times q$ symmetric and positive definite matrix of unknown constants. \item[] $\mathbf{W}_{i,1}$ and $\mathbf{W}_{i,2}$ are $p \times 1$ observable random vectors, each representing $\mathbf{X}_i$ plus random error. \item[] $\mathbf{V}_{i,1}$ and $\mathbf{V}_{i,2}$ are $q \times 1$ observable random vectors, each representing $\mathbf{Y}_i$ plus random error. \item[] $\mathbf{e}_{i,1}, \ldots, \mathbf{e}_{i,1}$ are the measurement errors in $\mathbf{W}_{i,1}, \mathbf{V}_{i,1}, \mathbf{W}_{i,2}$ and $\mathbf{V}_{i,2}$ respectively. Joining the vectors of measurement errors into a single long vector $\mathbf{e}_i$, its covariance matrix may be written as a partitioned matrix \begin{equation*} V(\mathbf{e}_i) = V\left(\begin{array}{c} \mathbf{e}_{i,1} \\ \mathbf{e}_{i,2} \\ \mathbf{e}_{i,3} \\ \mathbf{e}_{i,4} \end{array}\right) = \left( \begin{array}{c|c|c|c} \boldsymbol{\Omega}_{11} & \boldsymbol{\Omega}_{12} & \mathbf{0} & \mathbf{0} \\ \hline \boldsymbol{\Omega}_{12}^\prime & \boldsymbol{\Omega}_{22} & \mathbf{0} & \mathbf{0} \\ \hline \mathbf{0} & \mathbf{0} & \boldsymbol{\Omega}_{33} & \boldsymbol{\Omega}_{34} \\ \hline \mathbf{0} & \mathbf{0} & \boldsymbol{\Omega}_{34}^\prime & \boldsymbol{\Omega}_{44} \end{array} \right) = \boldsymbol{\Omega}. \end{equation*} \item[] In addition, the matrices of covariances between $\mathbf{X}_i, \boldsymbol{\epsilon}_i$ and $\mathbf{e}_i$ are all zero. \end{itemize} Collecting $\mathbf{W}_{i,1}$, $\mathbf{W}_{i,2}$, $\mathbf{V}_{i,1}$ and $\mathbf{V}_{i,2}$ into a single long data vector $\mathbf{D}_i$, we write its variance-covariance matrix as a partitioned matrix: \begin{displaymath} \boldsymbol{\Sigma} = \left( \begin{array}{c|c|c|c} \boldsymbol{\Sigma}_{11} & \boldsymbol{\Sigma}_{12} & \boldsymbol{\Sigma}_{13} & \boldsymbol{\Sigma}_{14 } \\ \hline & \boldsymbol{\Sigma}_{22} & \boldsymbol{\Sigma}_{23} & \boldsymbol{\Sigma}_{24} \\ \hline & & \boldsymbol{\Sigma}_{33} & \boldsymbol{\Sigma}_{34} \\ \hline & & & \boldsymbol{\Sigma}_{44} \end{array} \right), \end{displaymath} where the covariance matrix of $\mathbf{W}_{i,1}$ is $\boldsymbol{\Sigma}_{11}$, the covariance matrix of $\mathbf{V}_{i,1}$ is $\boldsymbol{\Sigma}_{22}$, the matrix of covariances between $\mathbf{W}_{i,1}$ and $\mathbf{V}_{i,1}$ is $\boldsymbol{\Sigma}_{12}$, and so on. \begin{enumerate} \item Write the elements of the partitioned matrix $\boldsymbol{\Sigma}$ in terms of the parameter matrices of the model. Be able to show your work for each one. \item Prove that all the model parameters are identifiable by solving the covariance structure equations. \item Give a Method of Moments estimator of $\boldsymbol{\Phi}$. There is more than one reasonable answer. Remember, your estimator cannot be a function of any unknown parameters, or you get a zero. For a particular sample, will your estimate be in the parameter space? Mine is. \item Give a Method of Moments estimator of $\boldsymbol{\beta}$. Remember, your estimator cannot be a function of any unknown parameters, or you get a zero. How do you know your estimator is consistent? You may use $\widehat{\boldsymbol{\Sigma}} \stackrel{a.s.}{\rightarrow} \boldsymbol{\Sigma}$ without proof. \end{enumerate} % that is \emph{not} the MLE added after the question was assigned in 2013. But in 2015 I specified MOM instead. \item Question \ref{SASpig} (the SAS part of this assignment) will use the \emph{Pig Birth Data}. As part of a much larger study, farmers filled out questionnaires about various aspects of their farms. Some questions were asked twice, on two different questionnaires several months apart. Buried in all the questions were \begin{itemize} \item Number of breeding sows (female pigs) at the farm on June 1st \item Number of sows giving birth later that summer \end{itemize} There are two readings of these variables, one from each questionnaire. We will assume (maybe incorrectly) that because the questions were buried in a lot of other material and were asked months apart, that errors of measurement are independent between the two questionnaires. However, errors of measurement might be correlated within a questionnaire. \begin{enumerate} \item Write down a reasonable model for these data, using the usual notation. Give all the details. You may assume normality if you wish. \item Of course it is hopeless to identify the expected values and intercepts, so we will concentrate on the covariance matrix. Calculate the covariance matrix of one observable data vector $\mathbf{D}_i$. \item Even though you have a general result that applies to this case, prove that all the parameters in the covariance matrix are identifiable. \item If there are any equality constraints on the covariance matrix, say what they are. \item Based on your answer to the last question, how many degrees of freedom should there be in the chisquare test for model fit? Does this agree with your answer to Question~\ref{nconstr}? \item \label{mombetahat} Give a consistent estimator of $\beta$ that is \emph{not} the MLE, and explain why it's consistent. You may use the consistency of sample variances and covariances without proof. Your estimator \emph{must not} be a function of any unknown parameters, or you get a zero on this part. \end{enumerate} \pagebreak \item \label{SASpig} The Pig Birth Data are given in the file \href{http://www.utstat.toronto.edu/~brunner/data/legal/openpigs.data.txt} {\texttt{openpigs.data.txt}}. There is a link on the course web page in case the one in this document does not work. Note there are $n=114$ farms, so please verify that you are reading the correct number of cases. Use the \texttt{firstobs} option in your \texttt{infile} statement. \begin{enumerate} \item Start by reading the data and then running \texttt{proc~corr} to produce a correlation matrix (with tests) of all the observable variables. \item Use \texttt{proc calis} to fit your model. Please use the \texttt{pshort nostand vardef=n pcorr} options. If you experience numerical problems you are doing something differently from the way I did it. When I fit a good model everything was fine. When I fit a poor model there was trouble. \item Does your model fit the data adequately? Answer Yes or No and give three numbers: a chisquare statistic, the degrees of freedom, and a $p$-value. % G^2 = 0.0871, df = 1, p = 0.7680 \item For each breeding sow present in September, what is the predicted number giving birth that summer? Your answer is a single number from the list file. It is not an integer. % betahat = 0.7567 \item Using your answer to Question~\ref{mombetahat}, the list file and a calculator, give a \emph{numerical} version of your consistent estimate of $\beta$. How does it compare to the MLE? % 0.5*(272.67101+260.02857)/348.52989 = 0.7642093 v.s. MLE of 0.7567. Pretty good! \item Recall that reliability of a measurement is the proportion of its variance that does \emph{not} come from measurement error. What is the estimated reliability of number of breeding sows from questionnaire two? The answer is a number, which you get with a calculator and the output file. % 1 - 93.82358/(0.7567^2*360.30522+33.93153 +93.82358) = 0.7191449 from proc calis MLEs \item Is there evidence of correlated measurement error within questionnaires? Answer Yes or No and give some numbers from the list file to support your conclusion. \end{enumerate} \end{enumerate} \vspace{10mm} \noindent Bring your log file and your list file to the quiz. You may be asked for numbers from your printouts, and you may be asked to hand them in. There are lots of \textbf{There must be no error messages, and no notes or warnings about invalid data on your log file.} \end{document} % \item The answer to that last question was based on two separate tests. Though it is already pretty convincing, conduct a \emph{single} Wald (not likelihood ratio) test of the two null hypotheses simultaneously. The SAS program \texttt{bmi3.sas} has an example of how to do a Wald test. % \begin{enumerate} % \item Give the Wald chi-squared statistic, the degrees of freedom and the $p$-value. What do you conclude? Is there evidence of correlated measurement error, or not? % W =45.41656 , df=2, p < 0.0001 % \item Find two examples of $Z^2 \sim \chi^2(1)$ from the output for this question. \end{enumerate} % \item The double measurement design allows the measurement error covariance matrices $\boldsymbol{\Omega}_1$ and $\boldsymbol{\Omega}_2$ to be unequal. Carry out a Wald test to see whether the two covariance matrices are equal or not. % \begin{enumerate} % \item Give the Wald chi-squared statistic, the degrees of freedom and the $p$-value. What do you conclude? Is there evidence that the two measurement error covariance matrices are unequal? % W = 41.69941 , df=3, p < 0.0001 % \item There is evidence that one of the measurements is less accurate on one questionnaire than the other. Which one is it? Give the Wald chi-squared statistic, the degrees of freedom and the $p$-value. % Number of pigs born is less accurate on questionnaire 2 (estimated variance = 93.82358 compared to 324.09723): W = 9.58393 , df=1, p = 0.0020 % \end{enumerate}