\documentclass[11pt]{article} %\usepackage{amsbsy} % for \boldsymbol and \pmb \usepackage{graphicx} % To include pdf files! \usepackage{amsmath} \usepackage{amsbsy} \usepackage{amsfonts} \usepackage[colorlinks=true, pdfstartview=FitV, linkcolor=blue, citecolor=blue, urlcolor=blue]{hyperref} % For links \usepackage{fullpage} %\pagestyle{empty} % No page numbers \begin{document} %\enlargethispage*{1000 pt} \begin{center} {\Large \textbf{STA 431s15 Assignment Nine}}\footnote{This assignment was prepared by \href{http://www.utstat.toronto.edu/~brunner}{Jerry Brunner}, Department of Statistical Sciences, University of Toronto. It is licensed under a \href{http://creativecommons.org/licenses/by-sa/3.0/deed.en_US} {Creative Commons Attribution - ShareAlike 3.0 Unported License}. Use any part of it as you like and share the result freely. The \LaTeX~source code is available from the course website: \href{http://www.utstat.toronto.edu/~brunner/oldclass/431s15} {\small\texttt{http://www.utstat.toronto.edu/$^\sim$brunner/oldclass/431s15}}} \vspace{1 mm} \end{center} \noindent The non-computer questions on this assignment are practice for the quiz, and will not be handed in. Please bring your log files and your output files for the SAS part of this assignment (Question~\ref{SASpoverty}) to the quiz. There may be one or more questions about them, and you may be asked to hand printouts in with the quiz. \begin{enumerate} \item The following model has all expected values zero and zero covariance between all pairs of exogenous variables. \begin{eqnarray} Y_1 &=& \gamma_1 X_1 +\gamma_2 X_2 + \epsilon_1 \nonumber \\ Y_2 &=& \beta Y_1 + \gamma_3 X_1 + \epsilon_2 \nonumber \\ W_1 &=& \lambda_1 X_1 + e_1 \nonumber \\ W_2 &=& \lambda_2 X_2 + e_2 \nonumber \\ V_1 &=& \lambda_3 Y_1 + e_3 \nonumber \\ V_2 &=& \lambda_4 Y_2 + e_4 \nonumber \end{eqnarray} Referring to the general two-stage structural equation model on the formula sheet, write the model equations in matrix form. This means put symbols from the model above in the matrices. Also give the matrices $\boldsymbol{\Phi}$, $\boldsymbol{\Psi}$ and $\boldsymbol{\Omega}$. The dimensions must be right for the specific model above. \item Consider the general factor analysis model \begin{displaymath} \mathbf{D}_i = \boldsymbol{\Lambda} \mathbf{F}_i + \mathbf{e}_i, \end{displaymath} where $\boldsymbol{\Lambda}$ is a $k\times p$ matrix of factor loadings, the vector of factors $\mathbf{F}_i$ is a $p\times 1$ multivariate normal with expected value zero and covariance matrix $\boldsymbol{\Phi}$, and $\mathbf{e}$ is multivariate normal with expected value zero and covariance matrix $\boldsymbol{\Omega}$. All covariance matrices are positive definite. \begin{enumerate} \item Calculate the matrix of covariances between the observable variables $\mathbf{D}_i$ and the underlying factors $\mathbf{F}_i$. \item Give the covariance matrix of $\mathbf{D}_i$. Show your work. \item Any positive definite matrix can be written as $\mathbf{SS}^\prime$, where $\mathbf{S}$ is the \emph{square root matrix}. Using the square root matrix of $\boldsymbol{\Phi}$, show that the parameters of the general factor analysis model are not identifiable. \item In an attempt to obtain a model whose parameters can be successfully estimated, let $\boldsymbol{\Omega}$ be diagonal (errors are uncorrelated) and set $\boldsymbol{\Phi}$ to the identity matrix (standardizing the factors). Show that the parameters of this revised model are still not identifiable. \end{enumerate} \newpage \item Here is a factor analysis model in which all the observed variables are \emph{standardized}. That is, they are divided by their standard deviations as well as having the means subtracted off. This gives them mean zero and variance one. Therefore, we work with a correlation matrix rather than a covariance matrix; that's the classical way to do factor analysis. Let \begin{eqnarray*} Z_1 & = & \lambda_1 F_1 + e_1 \\ Z_2 & = & \lambda_2 F_2 + e_2 \\ Z_3 & = & \lambda_3 F_3 + e_3, \end{eqnarray*} where $F_1$, $F_2$ and $F_3$ are independent $N(0,1)$, $e_1$, $e_2$ and $e_3$ are normal and independent of each other and of $F_1$, $F_2$ and $F_3$, $V(Z_1)=V(Z_2)=V(Z_3)=1$, and $\lambda_1$, $\lambda_2$ and $\lambda_3$ are nonzero constants. The expected values of all random variables equal zero. \begin{enumerate} \item What is $V(e_1)$? $V(e_2)$? $V(e_3)$? \item What is $Corr(F_1,Z_1)$? \item Give the communality of $Z_j$. Recall that the communality is the proportion of variance explained by the common factor(s). That is, it is the proportion of $Var(Z_j)$ that does not come from $e_j$. \item Give the variance-covariance matrix (correlation matrix) of the observed variables. \item Are the model parameters identifiable? Answer Yes or No and prove your answer. \item Even though the parameters are not identifiable, the model itself is testable. That is, it implies a set of equality restrictions on the correlation matrix $\boldsymbol{\Sigma}$ that could be tested, and rejecting the null hypothesis would call the model into question. State the null hypothesis. Again, it is a statement about the $\sigma_{i,j}$ values. \end{enumerate} \item \label{f1v3} Here is another factor analysis model. This one has a single underlying factor. Again, all the observed variables are standardized. \begin{eqnarray*} Z_1 & = & \lambda_1 F + e_1 \\ Z_2 & = & \lambda_2 F + e_2 \\ Z_3 & = & \lambda_3 F + e_3, \end{eqnarray*} where $F \sim N(0,1)$, $e_1$, $e_2$ and $e_3$ are normal and independent of $F$ and each other with expected value zero, $V(Z_1)=V(Z_2)=V(Z_3)=1$, and $\lambda_1$, $\lambda_2$ and $\lambda_3$ are nonzero constants with $\lambda_1>0$. \begin{enumerate} \item What is $V(e_1)$? $V(e_2)$? $V(e_3)$? \item Give the communality of $Z_j$. \item Write the reliability of $Z_j$ as a measure of $F$. Recall that the reliability is defined as the squared correlation of the true score with the observed score. % Looked into the reliability of the sum but it's a mess. \item Give the variance-covariance (correlation) matrix of the observed variables. \item Are the model parameters identifiable? Answer Yes or No and prove your answer. \end{enumerate} \item \label{f1v4} Suppose we added another variable to the model of Question~\ref{f1v3}. That is, we add \begin{displaymath} Z_4 = \lambda_4 F + e_4, \end{displaymath} with assumptions similar to the ones of Question~\ref{f1v3}. Now suppose that $\lambda_2=0$. \begin{enumerate} \item Is $\lambda_2$ identifiable? Justify your answer. \item Are the other factor loadings identifiable? Justify your answer. \item State the general pattern that is emerging here. \end{enumerate} \item \label{f1v5} Suppose we added a fifth variable to the model of Question~\ref{f1v4}. That is, we add \begin{displaymath} Z_5 = \lambda_5 F + e_5, \end{displaymath} with assumptions similar to the ones of Question~\ref{f1v3}. Now suppose that $\lambda_3=\lambda_4=0$. \begin{enumerate} \item Are $\lambda_3$ and $\lambda_4$ identifiable? Justify your answer. \item Are the other three factor loadings identifiable? Justify your answer. \item State the general pattern that is emerging here. \end{enumerate} \item \label{f2v6} We now extend the model of Question~\ref{f1v3} by adding a second factor. Let \begin{eqnarray*} Z_1 & = & \lambda_1 F_1 + e_1 \\ Z_2 & = & \lambda_2 F_1 + e_2 \\ Z_3 & = & \lambda_3 F_1 + e_3 \\ Z_4 & = & \lambda_4 F_2 + e_4 \\ Z_5 & = & \lambda_5 F_2 + e_5 \\ Z_6 & = & \lambda_6 F_2 + e_6, \end{eqnarray*} where all expected values are zero, $V(e_i)=\omega_i$ for $i=1, \ldots, 6$, $V(F_1)=V(F_2)=1$, $Cov(F_1,F_2) = \phi_{12}$, the factors are independent of the error terms, and all the error terms are independent of each other. All the factor loadings are non-zero. \begin{enumerate} \item Give the covariance matrix of the observable variables. Show the necessary work. A lot of the work has already been done. \item \label{iDenT} Are the model parameters identifiable? Answer Yes or No and prove your answer. \item Write the model in matrix form as $\mathbf{Z} = \boldsymbol{\Lambda} \mathbf{F} + \mathbf{e}$. That is give the matrices. For example, $\mathbf{Z}$ is $6 \times 1$. \item Recall that a \emph{rotation} matrix is any square matrix $\mathbf{R}$ satisfying $\mathbf{RR}^\top = \mathbf{I}$. Give a specific $2 \times 2$ rotation matrix $\mathbf{R}$ so that $\boldsymbol{\Lambda}$ and $\boldsymbol{\Lambda}_2 = \boldsymbol{\Lambda}\mathbf{R}$ yield the same $\boldsymbol{\Sigma} = V(\mathbf{D})$ for this model. This is another way of expressing your answer to Question~\ref{iDenT}. \item Suppose we add the conditions $\lambda_1>0$ and $\lambda_4>0$. Are the parameters identifiable now? \end{enumerate} \item \label{f2v5} In Question~\ref{f2v6}, suppose we added just two variables along with the second factor. That is, we omit the equation for $Z_6$, while keeping $\lambda_1>0$ and $\lambda_4>0$. Are the model parameters identifiable in this case? Answer Yes or No; show your work. \item \label{f3v9} Let's add a third factor to the model of Question~\ref{f2v6}. That is, we keep the equation for $Z_6$ and add \begin{eqnarray*} Z_7 & = & \lambda_7 F_3 + e_7 \\ Z_8 & = & \lambda_8 F_3 + e_8 \\ Z_9 & = & \lambda_9 F_3 + e_9 \end{eqnarray*} with $\lambda_1>0$, $\lambda_4>0$, $\lambda_7>0$ and other assumptions similar to the ones we have been using. Are the model parameters identifiable? You don't have to do any calculations if you see the pattern. \item \label{justone} In this factor analysis model, the observed variables are \emph{not} standardized, and the factor loading for $D_1$ is set equal to one. Let \begin{eqnarray*} D_1 & = & F + e_1 \\ D_2 & = & \lambda_2 F + e_2 \\ D_3 & = & \lambda_3 F + e_3, \end{eqnarray*} where $F \sim N(0,\phi)$, $e_1$, $e_2$ and $e_3$ are normal and independent of $F$ and each other with expected value zero, $V(e_1)=\omega_1,V(e_2)=\omega_2,V(e_3)=\omega_3$, and $\lambda_2$ and $\lambda_3$ are nonzero constants. \begin{enumerate} \item Calculate the variance-covariance matrix of the observed variables. \item Are the model parameters identifiable? Answer Yes or No and prove your answer. \end{enumerate} \item \label{two} We now extend the preceding model by adding another factor. Let \begin{eqnarray*} D_1 & = & F_1 + e_1 \\ D_2 & = & \lambda_2 F_1 + e_2 \\ D_3 & = & \lambda_3 F_1 + e_3 \\ D_4 & = & F_2 + e_4 \\ D_5 & = & \lambda_5 F_2 + e_5 \\ D_6 & = & \lambda_6 F_2 + e_6, \end{eqnarray*} where all expected values are zero, $V(e_i)=\omega_i$ for $i=1, \ldots, 6$, \begin{displaymath} \begin{array}{ccc} % Array of Arrays: Nice display of matrices. V\left( \begin{array}{c} F_1 \\ F_2 \end{array} \right) & = & \left( \begin{array}{c c} \phi_{11} & \phi_{12} \\ \phi_{12} & \phi_{22} \end{array} \right), \end{array} \end{displaymath} and $\lambda_2,\lambda_3, \lambda_5$ and $\lambda_6$ are nonzero constants. \begin{enumerate} \item Give the covariance matrix of the observable variables. Show the necessary work. A lot of the work has already been done in Question~\ref{justone}. \item Are the model parameters identifiable? Answer Yes or No and prove your answer. \end{enumerate} \item Let's add a third factor to the model of Question~\ref{two}. That is, we add \begin{eqnarray*} D_7 & = & F_3 + e_7 \\ D_8 & = & \lambda_8 F_3 + e_8 \\ D_9 & = & \lambda_9 F_3 + e_9 \\ \end{eqnarray*} and \begin{displaymath} \begin{array}{cccc} % Nice display of matrices. V\left( \begin{array}{c} F_1 \\ F_2 \\ F_3 \end{array} \right) & = & \left( \begin{array}{c c c} \phi_{11} & \phi_{12} & \phi_{13} \\ \phi_{12} & \phi_{22} & \phi_{23} \\ \phi_{13} & \phi_{23} & \phi_{33} \end{array} \right), \end{array} \end{displaymath} with $\lambda_8\neq0$, $\lambda_9\neq0$ and so on. Are the model parameters identifiable? You don't have to do any calculations if you see the pattern. %\newpage \item This question outlines a different approach to identifying the parameters of a measurement model. Recall that when latent variables are measured with error and the error terms are not correlated, only the variances of $\boldsymbol{\Sigma}$ are affected by measurement error. The covariances are untouched. \begin{enumerate} \item Just to remind yourself of this fact, calculate $\boldsymbol{\Sigma}$ for the following model. Independently for $i=1, \ldots, n$, let % Need eqnarray inside a parbox to make it the cell of a table \begin{tabular}{ccc} \parbox[m]{1.5in} { \begin{eqnarray*} D_{i,1} &=& F_{i,1} + e_{i,1} \\ D_{i,2} &=& F_{i,2} + e_{i,2} \\ && \end{eqnarray*} } % End parbox & $V\left( \begin{array}{c} F_{i,1} \\ F_{i,2} \end{array} \right) = \left( \begin{array}{c c} \phi_{11} & \phi_{12} \\ \phi_{12} & \phi_{22} \end{array} \right)$ & $V\left( \begin{array}{c} e_{i,1} \\ e_{i,2} \end{array} \right) = \left( \begin{array}{c c} \omega_1 & 0 \\ 0 & \omega_2 \end{array} \right)$ \end{tabular} \item Based on a random sample of $(D_1,D_2)$ pairs, do we have $\widehat{\boldsymbol{\Sigma}} \stackrel{a.s.}{\rightarrow} \boldsymbol{\Phi}$? Answer Yes or No and briefly justify your answer. \item Denote the reliability of $D_{i,1}$ as a measure of $F_{i,1}$ by $r_1$, and denote the reliability of $D_{i,2}$ as a measure of $F_{i,2}$ by $r_2$. Suppose you have good (consistent) estimates of $r_1$ and $r_2$ from another source; say $\widehat{r}_1 \stackrel{a.s.}{\rightarrow} r_1$ and $\widehat{r}_2 \stackrel{a.s.}{\rightarrow} r_2$. Give a consistent estimator of $\boldsymbol{\Phi}$. Show your work. \end{enumerate} The point of this question is that sometimes you can use ``auxiliary" (outside) information to identify the parameters of a measurement model and rescue the analysis of data that were not collected with latent variable modelling in mind. This is especially promising in Psychology, where a lot of effort is devoted to obtaining and publishing estimated reliabilities. Furthermore, take a look at the multivariate normal likelihood function on the formula sheet. Do we need the raw data? No, just $\widehat{\boldsymbol{\Sigma}}$. Use the corrected sample covariance matrix instead. \newpage \item \label{SASpoverty} The SAS part of this assignment is based on the Poverty Data. The data are given in the file \href{http://www.utstat.toronto.edu/~brunner/data/illegal/poverty.data.txt} {\texttt{poverty.data.txt}} . There is a link on the course web page in case the one in this document does not work. This data set contains information from a sample of 97 countries. In order, the variables include Live birth rate per 1,000 of population, Death rate per 1,000 of population, Infant deaths per 1,000 of population under 1 year old, Life expectancy at birth for males, Life expectancy at birth for females, and Gross National Product per capita in U.S. dollars. There is also a categorical variable representing location (continent), and finally the name of the country. This can be a very challenging and frustrating data set to work with, because correlated measurement errors produce negative variance estimates and other numerical problems almost everywhere you turn. To make your job easier, please confine your analyses to the following four variables: \begin{itemize} \item[] Life Expectancy: Average of life expectancy for males and life expectancy for females. \item[] Infant mortality rate. \item[] Birth rate. \item[] GNP/1000 = Gross national product in thousands of dollars. \end{itemize} You are only using four of the variables in the data file, but you should read them all, because other ways of skipping variables are more trouble. The names of character-valued variables (the last two) must be followed by dollar signs (\$). Here is a picture of a factor analysis model with 2 factors. \begin{center} \includegraphics[width=3in]{A12Pic3} % Need \usepackage{graphicx} \end{center} The reason for making birth rate an indicator of wealth is that birth control costs money. \begin{enumerate} \item Fit the model with \texttt{proc calis}. Make sure to include the \texttt{pcorr} option, so you will get $\boldsymbol{\Sigma(\widehat{\theta})}$. You will have to re-parameterize. Which of the two standard re-parameterizations should you choose? Suppose we are interested in the correlation between Health and Wealth. When I did this, I did \emph{not} standardize the observable variables. \item What is the parameter $\boldsymbol{\theta}$ for this model? \emph{Give your answer in the form of a list of names from your SAS job.} \item What rule tells you that the re-parameterization you have chosen results in a parameter vector that is identifiable (at least, it's identifiable in most of the parameter space)? Name the rule given in Lecture 21. \item Does this model fit the data adequately? Answer Yes or No, and back up your answer with two numbers from the printout: The value of a test statistic, and a $p$-value. \item What is the maximum likelihood estimate of the correlation between factors? The answer is a single number from the printout. % 0.96065 \item Now fit a model with the other common re-parameterization, again including the \texttt{pcorr} option. \begin{enumerate} \item Compare the two likelihood ratio tests for model fit. What do you see? \item Compare the two $\boldsymbol{\Sigma(\widehat{\theta})}$ matrices. What do you see? \item Give the maximum likelihood estimate of $\lambda_2/\lambda_1$ based on output from the first model. Can you find this number in the output from the second model? % 44.02500/(-10.25556) = -4.292793, compare -4.29281 from second model. \item Based on the output from the second model, give the maximum likelihood of the correlation between Health and Wealth. Can you find this number in the output from the first model? % 53.73277 / sqrt(105.17622*29.74605) = 0.9606514, compare 0.96065 from first model. \end{enumerate} \item Finally, the high estimated correlation between factors from the first part of this question suggests that there might be just one underlying factor: wealth. Try a single-factor model and see if it fits. Locate the relevant chi-squared statistic, degrees of freedom and $p$-value. Do the estimated factor loadings make sense? % Chi-squared = 2.7922 without vardef=n from proc calis, compared to 2.7922187 from proc factor "without Bartlett's correction." But I like vardef=n, so my answer is 2.8232. \end{enumerate} \end{enumerate} \vspace{60mm} \noindent Bring your log and your output files to the quiz. You may be asked for numbers from your printouts, and you may be asked to hand them in. \textbf{There must be no error messages, and no notes or warnings about invalid data on your log file.} \end{document} Defining $\mathbf{F} = \left( \begin{array}{c} F_1 \\ F_2 \end{array} \right)$, give a specific rotation matrix $\mathbf{R}$ so that