\documentclass[11pt]{article} %\usepackage{amsbsy} % for \boldsymbol and \pmb \usepackage{graphicx} % To include pdf files! \usepackage{amsmath} \usepackage{amsbsy} \usepackage{amsfonts} \usepackage[colorlinks=true, pdfstartview=FitV, linkcolor=blue, citecolor=blue, urlcolor=blue]{hyperref} % For links \usepackage{fullpage} %\pagestyle{empty} % No page numbers \begin{document} %\enlargethispage*{1000 pt} \begin{center} {\Large \textbf{STA 431s17 Assignment Ten}}\footnote{This assignment was prepared by \href{http://www.utstat.toronto.edu/~brunner}{Jerry Brunner}, Department of Statistical Sciences, University of Toronto. It is licensed under a \href{http://creativecommons.org/licenses/by-sa/3.0/deed.en_US} {Creative Commons Attribution - ShareAlike 3.0 Unported License}. Use any part of it as you like and share the result freely. The \LaTeX~source code is available from the course website: \href{http://www.utstat.toronto.edu/~brunner/oldclass/431s17} {\small\texttt{http://www.utstat.toronto.edu/$^\sim$brunner/oldclass/431s17}}} \vspace{1 mm} \end{center} \noindent The non-computer questions on this assignment are practice for the quiz, and will not be handed in. Please bring your log files and your results files for the SAS part of this assignment (Question~\ref{SASpoverty}) to the quiz. There may be one or more questions about them, and you may be asked to hand printouts in with the quiz. \begin{enumerate} \item The following model is centered, and has zero covariance between all pairs of exogenous variables including error terms. \begin{eqnarray} Y_1 &=& \gamma_1 X_1 +\gamma_2 X_2 + \epsilon_1 \nonumber \\ Y_2 &=& \beta Y_1 + \gamma_3 X_1 + \epsilon_2 \nonumber \\ W_1 &=& \lambda_1 X_1 + e_1 \nonumber \\ W_2 &=& \lambda_2 X_2 + e_2 \nonumber \\ V_1 &=& \lambda_3 Y_1 + e_3 \nonumber \\ V_2 &=& \lambda_4 Y_2 + e_4 \nonumber \end{eqnarray} \begin{enumerate} \item Make a path diagram. \item Referring to the general two-stage structural equation model on the formula sheet, write the model equations in matrix form. This means put symbols from the model above in the matrices. Also give the matrices $\boldsymbol{\Phi}_x$, $\boldsymbol{\Psi}$ and $\boldsymbol{\Omega}$. The dimensions must be right for the specific model above. \end{enumerate} \item Consider the general factor analysis model \begin{displaymath} \mathbf{D}_i = \boldsymbol{\Lambda} \mathbf{F}_i + \mathbf{e}_i, \end{displaymath} where $\boldsymbol{\Lambda}$ is a $k\times p$ matrix of factor loadings, the vector of factors $\mathbf{F}_i$ is a $p\times 1$ multivariate normal with expected value zero and covariance matrix $\boldsymbol{\Phi}$, and $\mathbf{e}_i$ is multivariate normal and independent of $\mathbf{F}_i$, with expected value zero and covariance matrix $\boldsymbol{\Omega}$. All covariance matrices are positive definite. \begin{enumerate} \item How do you know that $\mathbf{D}_i$ is multivariate normal? \item Calculate the matrix of covariances between the observable variables $\mathbf{D}_i$ and the underlying factors $\mathbf{F}_i$. \item Give the covariance matrix of $\mathbf{D}_i$. Show your work. \item Because $\boldsymbol{\Phi}$ symmetric and positive definite, it has a square root matrix that is also symmetric. Using this, show that the parameters of the general factor analysis model are not identifiable. \item In an attempt to obtain a model whose parameters can be successfully estimated, let $\boldsymbol{\Omega}$ be diagonal (errors are uncorrelated) and set $\boldsymbol{\Phi}$ to the identity matrix (standardizing the factors). Show that the parameters of this revised model are still not identifiable. Hint: An orthogonal matrix $\mathbf{R}$ (corresponding to a rotation) is one satisfying $\mathbf{RR}^\top=\mathbf{I}$. \end{enumerate} % There is a completely standardized version of this and following questions in 2015 A9. \item Let \begin{eqnarray*} D_1 & = & \lambda_1 F_1 + e_1 \\ D_2 & = & \lambda_2 F_2 + e_2 \\ D_3 & = & \lambda_3 F_3 + e_3, \end{eqnarray*} where $F_1$, $F_2$, $F_3$, $e_1$, $e_2$ and $e_3$ are all independent with $F_j \sim N(0,1)$ and $e_j \sim N(0,\omega_j)$. All the expected values are zero. You can tell from the notation which variables are observable. \begin{enumerate} \item Give the variance-covariance matrix of the observable variables. \item Are the model parameters identifiable? Answer Yes or No and prove your answer. \item Even though the parameters are not identifiable, the model itself is testable. That is, it implies a set of equality restrictions on the covariance matrix $\boldsymbol{\Sigma}$ that could be tested, and rejecting the null hypothesis would call the model into question. State the null hypothesis. Again, it is a statement about the $\sigma_{i,j}$ values. \end{enumerate} \item \label{f1v3} Here is another factor analysis model. This one has a single underlying factor. \begin{eqnarray*} D_1 & = & \lambda_1 F + e_1 \\ D_2 & = & \lambda_2 F + e_2 \\ D_3 & = & \lambda_3 F + e_3, \end{eqnarray*} where the factor and error terms are all independent, $F \sim N(0,1)$, $e_j \sim N(0,\omega_j)$, and $\lambda_1$, $\lambda_2$ and $\lambda_3$ are nonzero constants with $\lambda_1>0$. \begin{enumerate} \item Give the variance-covariance matrix of the observed variables. \item Are the model parameters identifiable? Answer Yes or No and prove your answer. \end{enumerate} \item \label{f1v4} Suppose we added another variable to the model of Question~\ref{f1v3}. That is, we add \begin{displaymath} D_4 = \lambda_4 F + e_4, \end{displaymath} with assumptions similar to the ones of Question~\ref{f1v3}. Now suppose that $\lambda_2=0$, while the other factor loadings are non-zero. \begin{enumerate} \item Is $\lambda_2$ identifiable? Justify your answer. \item Are the other factor loadings identifiable? Justify your answer. \end{enumerate} \item \label{f1v5} Suppose we added a fifth variable to the model of Question~\ref{f1v4}. That is, we add \begin{displaymath} D_5 = \lambda_5 F + e_5, \end{displaymath} with assumptions similar to the ones of Question~\ref{f1v3}. Now suppose that $\lambda_3=\lambda_4=0$, while the other factor loadings are non-zero. \begin{enumerate} \item Are $\lambda_3$ and $\lambda_4$ identifiable? Justify your answer. \item Are the other three factor loadings identifiable? Justify your answer. \item State the general pattern that is emerging here. \end{enumerate} \item \label{f2v6} We now extend the model of Question~\ref{f1v3} by adding a second factor. Let \begin{eqnarray*} D_1 & = & \lambda_1 F_1 + e_1 \\ D_2 & = & \lambda_2 F_1 + e_2 \\ D_3 & = & \lambda_3 F_1 + e_3 \\ D_4 & = & \lambda_4 F_2 + e_4 \\ D_5 & = & \lambda_5 F_2 + e_5 \\ D_6 & = & \lambda_6 F_2 + e_6, \end{eqnarray*} where all expected values are zero, $Var(e_i)=\omega_i$ for $i=1, \ldots, 6$, $Var(F_1)=Var(F_2)=1$, $Cov(F_1,F_2) = \phi_{12}$, the factors are independent of the error terms, and all the error terms are independent of each other. All the factor loadings are non-zero. \begin{enumerate} \item Give the covariance matrix of the observable variables. Show the necessary work. A lot of the work has already been done. \item \label{iDenT} Are the model parameters identifiable? Answer Yes or No and prove your answer. \item Write the model in matrix form as $\mathbf{D} = \boldsymbol{\Lambda} \mathbf{F} + \mathbf{e}$. That is give the matrices. For example, $\mathbf{D}$ is $6 \times 1$. \item Recall that a \emph{rotation} matrix is any square matrix $\mathbf{R}$ satisfying $\mathbf{RR}^\top = \mathbf{I}$. Give a specific $2 \times 2$ rotation matrix $\mathbf{R}$ so that $\boldsymbol{\Lambda}$ and $\boldsymbol{\Lambda}_2 = \boldsymbol{\Lambda}\mathbf{R}$ yield the same $\boldsymbol{\Sigma} = cov(\mathbf{D})$. Hint: Use your answer to Question~\ref{iDenT}. \item Suppose we add the conditions $\lambda_1>0$ and $\lambda_4>0$. Are the parameters identifiable now? \end{enumerate} \item \label{f2v5} In Question~\ref{f2v6}, suppose we added just two variables along with the second factor. That is, we omit the equation for $D_6$, while keeping $\lambda_1>0$ and $\lambda_4>0$. Are the model parameters identifiable in this case? Answer Yes or No; show your work. \item \label{f3v9} Let's add a third factor to the model of Question~\ref{f2v6}. That is, we keep the equation for $D_6$ and add \begin{eqnarray*} D_7 & = & \lambda_7 F_3 + e_7 \\ D_8 & = & \lambda_8 F_3 + e_8 \\ D_9 & = & \lambda_9 F_3 + e_9 \end{eqnarray*} with $\lambda_1>0$, $\lambda_4>0$, $\lambda_7>0$ and other assumptions similar to the ones we have been using. Are the model parameters identifiable? You don't have to do any calculations if you see the pattern. \item \label{justone} In this factor analysis model, the observed variables are \emph{not} standardized, and the factor loading for $D_1$ is set equal to one. Let \begin{eqnarray*} D_1 & = & F + e_1 \\ D_2 & = & \lambda_2 F + e_2 \\ D_3 & = & \lambda_3 F + e_3, \end{eqnarray*} where $F \sim N(0,\phi)$, $e_1$, $e_2$ and $e_3$ are normal and independent of $F$ and each other with expected value zero, $Var(e_1)=\omega_1,Var(e_2)=\omega_2,Var(e_3)=\omega_3$, and $\lambda_2$ and $\lambda_3$ are nonzero constants. \begin{enumerate} \item Calculate the variance-covariance matrix of the observed variables. \item Are the model parameters identifiable? Answer Yes or No and prove your answer. \end{enumerate} \item \label{two} We now extend the preceding model by adding another factor. Let \begin{eqnarray*} D_1 & = & F_1 + e_1 \\ D_2 & = & \lambda_2 F_1 + e_2 \\ D_3 & = & \lambda_3 F_1 + e_3 \\ D_4 & = & F_2 + e_4 \\ D_5 & = & \lambda_5 F_2 + e_5 \\ D_6 & = & \lambda_6 F_2 + e_6, \end{eqnarray*} where all expected values are zero, $Var(e_i)=\omega_i$ for $i=1, \ldots, 6$, \begin{displaymath} \begin{array}{ccc} % Array of Arrays: Nice display of matrices. cov\left( \begin{array}{c} F_1 \\ F_2 \end{array} \right) & = & \left( \begin{array}{c c} \phi_{11} & \phi_{12} \\ \phi_{12} & \phi_{22} \end{array} \right), \end{array} \end{displaymath} and $\lambda_2,\lambda_3, \lambda_5$ and $\lambda_6$ are nonzero constants. \begin{enumerate} \item Give the covariance matrix of the observable variables. Show the necessary work. A lot of the work has already been done in Question~\ref{justone}. \item Are the model parameters identifiable? Answer Yes or No and prove your answer. \end{enumerate} \item Let's add a third factor to the model of Question~\ref{two}. That is, we add \begin{eqnarray*} D_7 & = & F_3 + e_7 \\ D_8 & = & \lambda_8 F_3 + e_8 \\ D_9 & = & \lambda_9 F_3 + e_9 \\ \end{eqnarray*} and \begin{displaymath} \begin{array}{cccc} % Nice display of matrices. cov\left( \begin{array}{c} F_1 \\ F_2 \\ F_3 \end{array} \right) & = & \left( \begin{array}{c c c} \phi_{11} & \phi_{12} & \phi_{13} \\ \phi_{12} & \phi_{22} & \phi_{23} \\ \phi_{13} & \phi_{23} & \phi_{33} \end{array} \right), \end{array} \end{displaymath} with $\lambda_8\neq0$, $\lambda_9\neq0$ and so on. Are the model parameters identifiable? You don't have to do any calculations if you see the pattern. %\newpage \newpage \item \label{SASpoverty} The SAS part of this assignment is based on the Poverty Data. The data are given in the file \href{http://www.utstat.toronto.edu/~brunner/data/illegal/poverty.data.txt} {\texttt{poverty.data.txt}} . There is a link on the course web page in case the one in this document does not work. This data set contains information from a sample of 97 countries. In order, the variables include Live birth rate per 1,000 of population, Death rate per 1,000 of population, Infant deaths per 1,000 of population under 1 year old, Life expectancy at birth for males, Life expectancy at birth for females, and Gross National Product per capita in U.S. dollars. There is also a categorical variable representing location (continent), and finally the name of the country. This can be a very challenging and frustrating data set to work with, because correlated measurement errors produce negative variance estimates and other numerical problems almost everywhere you turn. To make your job easier, please confine your analyses to the following four variables: \begin{itemize} \item[] Life Expectancy: Average of life expectancy for males and life expectancy for females. \item[] Infant mortality rate. \item[] Birth rate. \item[] GNP/1000 = Gross national product in thousands of dollars. The re-scaling is a solution to numerical problems in fitting the model. \end{itemize} You are not using all the variables in the data file, but you should read them all, because other ways of skipping variables are more trouble. The names of character-valued variables (the last two) must be followed by dollar signs (\$). Here is a picture of a factor analysis model with 2 factors. \begin{center} \includegraphics[width=3in]{PovFactorPic} % Need \usepackage{graphicx} \end{center} The reason for making birth rate an indicator of wealth is that birth control costs money. \begin{enumerate} \item Fit the model with \texttt{proc calis}. My value of the Schwarz Bayesian Criterion (whatever that is) is 41.6961. Make sure to include the \texttt{pcorr} option, so you will get $\boldsymbol{\Sigma(\widehat{\theta})}$. You will have to re-parameterize. Which of the two standard re-parameterizations should you choose? Suppose we are interested in the correlation between Health and Wealth. \item What are the unknown parameters for this model? Give your answer in the form of a list of names from your SAS job. \item What rule tells you that the re-parameterization you have chosen results in a parameter vector that is identifiable (at least, it's identifiable in most of the parameter space)? Name the rule given in Lecture 20. \item Does this model fit the data adequately? Answer Yes or No, and back up your answer with two numbers from the printout: The value of a test statistic, and a $p$-value. % G^2 = 1.0983, p = 0.2946 \item Why does the goodness of fit test have one degree of freedom? \item What is the maximum likelihood estimate of the correlation between factors? The answer is a single number from the printout. % 0.96065 \item Now fit a model with the other common re-parameterization, again including the \texttt{pcorr} option. \begin{enumerate} \item Compare the two likelihood ratio tests for model fit. What do you see? \item Compare the two $\boldsymbol{\Sigma(\widehat{\theta})}$ matrices. What do you see? \item Give the maximum likelihood estimate of $\lambda_2/\lambda_1$ based on output from the \emph{first} model. Can you find this number in the output from the second model? % 44.02500/(-10.25556) = -4.292793, compare -4.29281 from second model. \item Based on the output from the second model, give the maximum likelihood of the correlation between Health and Wealth. Can you find this number in the output from the first model? % 53.73277 / sqrt(105.17622*29.74605) = 0.9606514, compare 0.96065 from first model. \end{enumerate} \item Finally, the high estimated correlation between factors from the first part of this question suggests that there might be just one underlying factor: wealth. Try a single-factor model and see if it fits. Locate the relevant chi-squared statistic, degrees of freedom and $p$-value. Do the estimated factor loadings make sense? What do you conclude? Do you like the one-factor model or the two-factor model? % Chi-squared = 2.8232, or 2.7922 without vardef=n from proc calis, compared to 2.7922187 from proc factor "without Bartlett's correction." But I like vardef=n, so my answer is 2.8232. \end{enumerate} \end{enumerate} \vspace{60mm} \noindent Bring your log file and your results file to the quiz. You may be asked for numbers from your printouts, and you may be asked to hand them in. \textbf{There must be no error messages, and no notes or warnings about invalid data on your log file.} Warnings about missing data are okay. The USSR did not co-operate. \end{document} % Wait with this one and possibly make it lecture. \item This question outlines a different approach to identifying the parameters of a measurement model. Recall that when latent variables are measured with error and the error terms are not correlated, only the variances of $\boldsymbol{\Sigma}$ are affected by measurement error. The covariances are untouched. \begin{enumerate} \item Just to remind yourself of this fact, calculate $\boldsymbol{\Sigma}$ for the following model. Independently for $i=1, \ldots, n$, let % Need eqnarray inside a parbox to make it the cell of a table \begin{tabular}{ccc} \parbox[m]{1.5in} { \begin{eqnarray*} D_{i,1} &=& F_{i,1} + e_{i,1} \\ D_{i,2} &=& F_{i,2} + e_{i,2} \\ && \end{eqnarray*} } % End parbox & $cov\left( \begin{array}{c} F_{i,1} \\ F_{i,2} \end{array} \right) = \left( \begin{array}{c c} \phi_{11} & \phi_{12} \\ \phi_{12} & \phi_{22} \end{array} \right)$ & $cov\left( \begin{array}{c} e_{i,1} \\ e_{i,2} \end{array} \right) = \left( \begin{array}{c c} \omega_1 & 0 \\ 0 & \omega_2 \end{array} \right)$ \end{tabular} \item Based on a random sample of $(D_1,D_2)$ pairs, do we have $\widehat{\boldsymbol{\Sigma}} \stackrel{a.s.}{\rightarrow} \boldsymbol{\Phi}$? Answer Yes or No and briefly justify your answer. \item Denote the reliability of $D_{i,1}$ as a measure of $F_{i,1}$ by $r_1$, and denote the reliability of $D_{i,2}$ as a measure of $F_{i,2}$ by $r_2$. Suppose you have good (consistent) estimates of $r_1$ and $r_2$ from another source; say $\widehat{r}_1 \stackrel{a.s.}{\rightarrow} r_1$ and $\widehat{r}_2 \stackrel{a.s.}{\rightarrow} r_2$. Give a consistent estimator of $\boldsymbol{\Phi}$. Show your work. \end{enumerate} The point of this question is that sometimes you can use ``auxiliary" (outside) information to identify the parameters of a measurement model and rescue the analysis of data that were not collected with latent variable modelling in mind. This is especially promising in Psychology, where a lot of effort is devoted to obtaining and publishing estimated reliabilities. Furthermore, take a look at the multivariate normal likelihood function on the formula sheet. Do we need the raw data? No, just $\widehat{\boldsymbol{\Sigma}}$. Use the corrected sample covariance matrix instead.