% 431Assignment7.tex Regression with measurement error, some identifiability \documentclass[12pt]{article} %\usepackage{amsbsy} % for \boldsymbol and \pmb \usepackage{graphicx} % To include pdf files! \usepackage{amsmath} \usepackage{amsbsy} \usepackage{amsfonts} \usepackage[colorlinks=true, pdfstartview=FitV, linkcolor=blue, citecolor=blue, urlcolor=blue]{hyperref} % For links \usepackage{fullpage} %\pagestyle{empty} % No page numbers \begin{document} %\enlargethispage*{1000 pt} \begin{center} {\Large \textbf{STA 431s17 Assignment Seven}}\footnote{This assignment was prepared by \href{http://www.utstat.toronto.edu/~brunner}{Jerry Brunner}, Department of Statistical Sciences, University of Toronto. It is licensed under a \href{http://creativecommons.org/licenses/by-sa/3.0/deed.en_US} {Creative Commons Attribution - ShareAlike 3.0 Unported License}. Use any part of it as you like and share the result freely. The \LaTeX~source code is available from the course website: \href{http://www.utstat.toronto.edu/~brunner/oldclass/431s17} {\small\texttt{http://www.utstat.toronto.edu/$^\sim$brunner/oldclass/431s17}}} \vspace{1 mm} \end{center} \noindent This assignment is about the double measurement design. See lecture Slide Sets 11 and 12, and Section 0.11 (pages 51-61) in Chapter Zero. The non-computer questions on this assignment are for practice, and will not be handed in. For the SAS part of this assignment (Question~\ref{SASpig}) please bring hard copy of your log file and your results file to the quiz. There may be one or more questions about them, and you may be asked to hand the printouts in with the quiz. \begin{enumerate} % This first one is pretty good, though. \item The point of this question is that when the parameters of a model are identifiable, the number of covariance structure equations minus the number of parameters equals the number of model-induced equality constraints on $\boldsymbol{\Sigma}$. It is these equality constraints that are being tested by the chi-squared test for goodness of fit. In the lecture notes, look at the matrix formulation and discussion of double measurement regression starting on Slide 25. The latent vector $\mathbf{X}_i$ is $p \times 1$, and the latent vector $\mathbf{Y}_i$ is $q \times 1$. As usual, expected values and intercepts are not identifiable, so confine your attention to $\boldsymbol{\Sigma} = [\sigma_{ij}]$, the covariance matrix of the observable data. % In this question, watch out for Stage 1 and Stage 2. I intended to switch the order (2013). But I didn't: 2015, at least not in lecture. It's okay in 2017. \begin{enumerate} \item Here's something that will help with the calculations in this problem. If a covariance matrix is $n \times n$, \begin{enumerate} \item How many unique covariances are there? Factor and simplify. \item How many unique variances and covariances total are there? Factor and simplify. \end{enumerate} \item What are the dimensions of $\boldsymbol{\Sigma}$? Give the number of rows and the number of columns. It's an expression in $p$ and $q$. \item \label{howmanysigmas} How many unique variances and covariances ($\sigma_{ij}$ quantities) are there in $\boldsymbol{\Sigma}$ when there are no model-induced constraints? The answer is an expression in $p$ and $q$. \item List the parameter matrices that appear in $\boldsymbol{\Sigma}$. \item Denoting $cov(\mathbf{F}_i)$ by $\boldsymbol{\Phi} = [\phi_{ij}]$, how many unique variances and covariances ($\phi_{ij}$ quantities) are there in $\boldsymbol{\Phi} = cov(\mathbf{F}_i)$ if there are no model-induced equality constraints? The answer is an expression in $p$ and $q$. \item In total, how many unknown parameters are there in the Stage One matrices $\boldsymbol{\Phi}_x$, $\boldsymbol{\beta}_1$ and $\boldsymbol{\Psi}$? The answer is an expression in $p$ and $q$. Is this the same as your last answer? If so, it means that at the first stage, if the parameters are identifiable from $\boldsymbol{\Phi}$, they are \emph{just identifiable} from $\boldsymbol{\Phi}$. \item Still in Stage One (the latent variable model), show the details of how the parameter matrices $\boldsymbol{\Phi}_x$, $\boldsymbol{\beta}_1$ and $\boldsymbol{\Psi}$ can be recovered from $\boldsymbol{\Phi}$. Start by calculating $\boldsymbol{\Phi}$ as a function of $\boldsymbol{\Phi}_x$, $\boldsymbol{\beta}_1$ and $\boldsymbol{\Psi}$. You have shown that the function relating $\boldsymbol{\Phi}$ to $(\boldsymbol{\Phi}_x, \boldsymbol{\beta}_1, \boldsymbol{\Psi})$ is one-to-one. \item In Stage Two (the measurement model), the parameters are in the matrices $\boldsymbol{\Phi}$, $\boldsymbol{\Omega}_1$ and $\boldsymbol{\Omega}_2$. How many unique parameters are there? The answer is an expression in $p$ and $q$. \item \label{nconstr} By inspecting the expression for $\boldsymbol{\Sigma}$ on Slide 30, state the number of equality constraints that are imposed on $\boldsymbol{\Sigma}$ by the model. The answer is an expression in $p$ and $q$. \item Show that the number of parameters plus the number of constraints is equal to the number of unique variances and covariances in $\boldsymbol{\Sigma}$. This is a brief calculation using your answers to~\ref{howmanysigmas} and the last two questions. \end{enumerate} % \newpage \item Here is a one-stage formulation of the double measurement regression model. % See the text for some discussion. Independently for $i=1, \ldots, n$, let \begin{eqnarray*} \mathbf{W}_{i,1} & = & \mathbf{X}_i + \mathbf{e}_{i,1} \\ \mathbf{V}_{i,1} & = & \mathbf{Y}_i + \mathbf{e}_{i,2} \nonumber \\ \mathbf{W}_{i,2} & = & \mathbf{X}_i + \mathbf{e}_{i,3}, \nonumber \\ \mathbf{V}_{i,2} & = & \mathbf{Y}_i + \mathbf{e}_{i,4}, \nonumber \\ \mathbf{Y}_i & = & \boldsymbol{\beta} \mathbf{X}_i + \boldsymbol{\epsilon}_i \nonumber \end{eqnarray*} where \begin{itemize} \item[] $\mathbf{Y}_i$ is a $q \times 1$ random vector of latent response variables. Because $q$ can be greater than one, the regression is multivariate. \item[] $\boldsymbol{\beta}$ is an $q \times p$ matrix of unknown constants. These are the regression coefficients, with one row for each response variable and one column for each explanatory variable. \item[] $\mathbf{X}_i$ is a $p \times 1$ random vector of latent explanatory variables, with expected value zero and variance-covariance matrix $\boldsymbol{\Phi}_x$, a $p \times p$ symmetric and positive definite matrix of unknown constants. \item[] $\boldsymbol{\epsilon}_i$ is the error term of the latent regression. It is a $q \times 1$ random vector with expected value zero and variance-covariance matrix $\boldsymbol{\Psi}$, a $q \times q$ symmetric and positive definite matrix of unknown constants. \item[] $\mathbf{W}_{i,1}$ and $\mathbf{W}_{i,2}$ are $p \times 1$ observable random vectors, each representing $\mathbf{X}_i$ plus random error. \item[] $\mathbf{V}_{i,1}$ and $\mathbf{V}_{i,2}$ are $q \times 1$ observable random vectors, each representing $\mathbf{Y}_i$ plus random error. \item[] $\mathbf{e}_{i,1}, \ldots, \mathbf{e}_{i,4}$ are the measurement errors in $\mathbf{W}_{i,1}, \mathbf{V}_{i,1}, \mathbf{W}_{i,2}$ and $\mathbf{V}_{i,2}$ respectively. Joining the vectors of measurement errors into a single long vector $\mathbf{e}_i$, its covariance matrix may be written as a partitioned matrix \begin{equation*} cov(\mathbf{e}_i) = cov\left(\begin{array}{c} \mathbf{e}_{i,1} \\ \mathbf{e}_{i,2} \\ \mathbf{e}_{i,3} \\ \mathbf{e}_{i,4} \end{array}\right) = \left( \begin{array}{c|c|c|c} \boldsymbol{\Omega}_{11} & \boldsymbol{\Omega}_{12} & \mathbf{0} & \mathbf{0} \\ \hline \boldsymbol{\Omega}_{12}^\top & \boldsymbol{\Omega}_{22} & \mathbf{0} & \mathbf{0} \\ \hline \mathbf{0} & \mathbf{0} & \boldsymbol{\Omega}_{33} & \boldsymbol{\Omega}_{34} \\ \hline \mathbf{0} & \mathbf{0} & \boldsymbol{\Omega}_{34}^\top & \boldsymbol{\Omega}_{44} \end{array} \right) = \boldsymbol{\Omega}. \end{equation*} \item[] In addition, the matrices of covariances between $\mathbf{X}_i, \boldsymbol{\epsilon}_i$ and $\mathbf{e}_i$ are all zero. \end{itemize} Collecting $\mathbf{W}_{i,1}$, $\mathbf{W}_{i,2}$, $\mathbf{V}_{i,1}$ and $\mathbf{V}_{i,2}$ into a single long data vector $\mathbf{D}_i$, we write its variance-covariance matrix as a partitioned matrix: \begin{displaymath} \boldsymbol{\Sigma} = \left( \begin{array}{c|c|c|c} \boldsymbol{\Sigma}_{11} & \boldsymbol{\Sigma}_{12} & \boldsymbol{\Sigma}_{13} & \boldsymbol{\Sigma}_{14 } \\ \hline & \boldsymbol{\Sigma}_{22} & \boldsymbol{\Sigma}_{23} & \boldsymbol{\Sigma}_{24} \\ \hline & & \boldsymbol{\Sigma}_{33} & \boldsymbol{\Sigma}_{34} \\ \hline & & & \boldsymbol{\Sigma}_{44} \end{array} \right), \end{displaymath} where the covariance matrix of $\mathbf{W}_{i,1}$ is $\boldsymbol{\Sigma}_{11}$, the covariance matrix of $\mathbf{V}_{i,1}$ is $\boldsymbol{\Sigma}_{22}$, the matrix of covariances between $\mathbf{W}_{i,1}$ and $\mathbf{V}_{i,1}$ is $\boldsymbol{\Sigma}_{12}$, and so on. \begin{enumerate} \item Write the elements of the partitioned matrix $\boldsymbol{\Sigma}$ in terms of the parameter matrices of the model. Be able to show your work for each one. \item Prove that all the model parameters are identifiable by solving the covariance structure equations. \item Give a Method of Moments estimator of $\boldsymbol{\Phi}_x$. There is more than one reasonable answer. Remember, your estimator cannot be a function of any unknown parameters, or you get a zero. For a particular sample, will your estimate be in the parameter space? Mine is. \item Give a Method of Moments estimator for $\boldsymbol{\beta}$. Remember, your estimator cannot be a function of any unknown parameters, or you get a zero. How do you know your estimator is consistent? Use $\widehat{\boldsymbol{\Sigma}} \stackrel{p}{\rightarrow} \boldsymbol{\Sigma}$. \end{enumerate} % that is \emph{not} the MLE added after the question was assigned in 2013. But in 2015 I specified MOM instead. \item Question \ref{SASpig} (the SAS part of this assignment) will use the \emph{Pig Birth Data}. As part of a much larger study, farmers filled out questionnaires about various aspects of their farms. Some questions were asked twice, on two different questionnaires several months apart. Buried in all the questions were \begin{itemize} \item Number of breeding sows (female pigs) at the farm on June 1st \item Number of sows giving birth later that summer. \end{itemize} There are two readings of these variables, one from each questionnaire. We will assume (maybe incorrectly) that because the questions were buried in a lot of other material and were asked months apart, that errors of measurement are independent between the two questionnaires. However, errors of measurement might be correlated within a questionnaire. \begin{enumerate} \item Propose a reasonable model for these data, using the usual notation. Give all the details. You may assume normality if you wish. \item Make a path diagram of the model you have proposed. \item Write the model equations again, this time in centered form. The little $c$ symbols above the variables can be invisible. \item Of course it is hopeless to identify the expected values and intercepts, so we will concentrate on the covariance matrix. Calculate the covariance matrix of one observable data vector $\mathbf{D}_i$. \item Even though you have a general result that applies to this case, prove that all the parameters in the covariance matrix are identifiable. \item If there are any equality constraints on the covariance matrix, say what they are. \item Based on your answer to the last question, how many degrees of freedom should there be in the chi-squared test for model fit? Does this agree with your answer to Question~\ref{nconstr}? \item \label{mombetahat} Give a consistent estimator of $\beta$ that is \emph{not} the MLE, and explain why it's consistent. You may use the consistency of sample variances and covariances without proof. Your estimator \emph{must not} be a function of any unknown parameters, or you get a zero on this part. \end{enumerate} % \pagebreak \item \label{SASpig} The Pig Birth Data are given in the file \href{http://www.utstat.toronto.edu/~brunner/data/legal/openpigs.data.txt} {\texttt{openpigs.data.txt}}. Use the \texttt{firstobs} option in your \texttt{infile} statement to skip the first few lines. This is preferable to stripping the data file of documentation. There are $n=114$ farms; please verify that you are reading the correct number of cases. \begin{enumerate} \item Start by reading the data and then running \texttt{proc~corr} to produce a correlation matrix (with tests) of all the observable variables. \item Use \texttt{proc calis} to fit your model. Please use the \texttt{pshort nostand vardef=n pcorr} options. If you experience numerical problems you are doing something differently from the way I did it. When I fit a good model everything was fine. When I fit a poor model there was trouble. Just to verify that we are fitting the same model, my value of the Akaike Information Criterion (which we're not using) is 18.0871. \item Does your model fit the data adequately? Answer Yes or No and give three numbers: a chi-squared statistic, the degrees of freedom, and a $p$-value. % G^2 = 0.0871, df = 1, p = 0.7680 \item \label{betahat} For each breeding sow present in September, what is the predicted number giving birth that summer? Your answer is a single number from the results file. It is not an integer. % betahat = 0.7567 \item Using your answer to Question~\ref{mombetahat}, the results file and a calculator, give a \emph{numerical} version of your consistent estimate of $\beta$. How does it compare to the MLE? % 0.5*(272.67101+260.02857)/348.52989 = 0.7642093 v.s. MLE of 0.7567. Pretty good! \item Since maximum likelihood estimates are asymptotically normal (approximately normal for large samples), a large-sample confidence interval is $\widehat{\theta} \pm 1.96 se$, where $se$ is the standard error (estimated standard deviation) of $\widehat{\theta}$. Give a large-sample confidence interval for your answer to \ref{betahat}. \item Recall that reliability of a measurement is the proportion of its variance that does \emph{not} come from measurement error. What is the estimated reliability of number of breeding sows from questionnaire two? The answer is a number, which you get with a calculator and the output file. % 1 - 93.82358/(0.7567^2*360.30522+33.93153 +93.82358) = 0.7191449 from proc calis MLEs \item Is there evidence of correlated measurement error within questionnaires? Answer Yes or No and give some numbers from the results file to support your conclusion. \item The answer to that last question was based on two separate tests. Though it is already pretty convincing, conduct a \emph{single} Wald (not likelihood ratio) test of the two null hypotheses simultaneously. The SAS program \texttt{bmi3.sas} has an example of how to do a Wald test. \begin{enumerate} \item Give the Wald chi-squared statistic, the degrees of freedom and the $p$-value. What do you conclude? Is there evidence of correlated measurement error, or not? % W =45.41656 , df=2, p < 0.0001 \item Find two examples of $Z^2 \sim \chi^2(1)$ from the output for this question. Locate the tests and verify that the one-sided $p$-value from the $\chi^2$ test equals the two-sided $p$-value from the $Z$ test. \end{enumerate} \item The double measurement design allows the measurement error covariance matrices $\boldsymbol{\Omega}_1$ and $\boldsymbol{\Omega}_2$ to be unequal. Carry out a Wald test to see whether the two covariance matrices are equal or not. \begin{enumerate} \item Give the Wald chi-squared statistic, the degrees of freedom and the $p$-value. What do you conclude? Is there evidence that the two measurement error covariance matrices are unequal? % W = 41.69941 , df=3, p < 0.0001 \item There is evidence that one of the measurements is less accurate on one questionnaire than the other. Which one is it? Give the Wald chi-squared statistic, the degrees of freedom and the $p$-value. % Number of pigs born is less accurate on questionnaire 2 (estimated variance = 93.82358 compared to 324.09723): W = 9.58393 , df=1, p = 0.0020 \end{enumerate} \end{enumerate} \end{enumerate} \vspace{10mm} \noindent Bring your log file and your results file to the quiz. You may be asked for numbers from your printouts, and you may be asked to hand them in. There are lots of \textbf{There must be no error messages, and no notes or warnings about invalid data on your log file.} \end{document} %%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%