% 431s23Assignment10.tex     Confirmatory factor analysis
\documentclass[11pt]{article} 
%\usepackage{amsbsy} % for \boldsymbol and \pmb 
\usepackage{graphicx} % To include pdf files!
\usepackage{amsmath}
\usepackage{amsbsy}
\usepackage{amsfonts}
\usepackage{comment}
\usepackage{alltt} % For colouring in verbatim-like environment.
\usepackage[colorlinks=true, pdfstartview=FitV, linkcolor=blue, citecolor=blue, urlcolor=blue]{hyperref} % For links
\usepackage{fullpage}
%\pagestyle{empty} % No page numbers


\begin{document}

%\enlargethispage*{1000 pt} 


\begin{center}   
{\Large \textbf{STA 431s23 Assignment Ten}}\footnote{This assignment was prepared by  \href{http://www.utstat.toronto.edu/~brunner}{Jerry Brunner},
Department of Statistical Sciences, University of Toronto. It is licensed under a 
\href{http://creativecommons.org/licenses/by-sa/3.0/deed.en_US}
     {Creative Commons Attribution - ShareAlike 3.0 Unported License}. Use any part of it as you like and share the result freely. The \LaTeX~source code is available from the course website:
\href{http://www.utstat.toronto.edu/brunner/oldclass/431s23} {\small\texttt{http://www.utstat.toronto.edu/brunner/oldclass/431s23}}}
\vspace{1 mm}
\end{center}

\noindent
This assignment is for the quiz on Monday April 10th, makeup day. For the quiz, please bring a printout of your full R input and output for Question~\ref{Rpoverty}. The other problems are not to be handed in. They are practice for the Quiz.
\vspace{2mm}
\hrule

\begin{enumerate} 


% There is a completely standardized version of this and following questions in 2015 A9.
    \item Let 
        \begin{eqnarray*}
        D_1 & = & \lambda_1 F_1 + e_1 \\
        D_2 & = & \lambda_2 F_2 + e_2 \\
        D_3 & = & \lambda_3 F_3 + e_3, 
        \end{eqnarray*}
where $F_1$, $F_2$, $F_3$, $e_1$, $e_2$ and $e_3$ are all independent with $F_j \sim N(0,1)$ and $e_j \sim N(0,\omega_j)$. All the expected values are zero. You can tell from the notation which variables are observable.
        \begin{enumerate}
            \item Give the variance-covariance matrix of the observable variables. 
            \item Are the model parameters identifiable? Answer Yes or No and prove your answer.
            \item Even though the parameters are not identifiable, the model itself is testable. That is, it implies a set of equality restrictions on the covariance matrix $\boldsymbol{\Sigma}$ that could be tested, and rejecting the null hypothesis would call the model into question. State the null hypothesis. Again, it is a statement about the $\sigma_{i,j}$ values.
        \end{enumerate}

    \item \label{f1v3} Here is another factor analysis model. This one has a single underlying factor. 
        \begin{eqnarray*}
        D_1 & = & \lambda_1 F + e_1 \\
        D_2 & = & \lambda_2 F + e_2 \\
        D_3 & = & \lambda_3 F + e_3, 
        \end{eqnarray*}
where the factor and error terms are all independent, $F \sim N(0,1)$, $e_j \sim N(0,\omega_j)$, and $\lambda_1$, $\lambda_2$ and $\lambda_3$ are nonzero constants with $\lambda_1>0$.
        \begin{enumerate}
            \item Give the variance-covariance matrix of the observed variables. 
            \item Are the model parameters identifiable?  Answer Yes or No and prove your answer. You are proving part of the 3-variable rule, so don't just cite it.
        \end{enumerate}

    \item \label{f1v4} Suppose we added another variable to the model of Question~\ref{f1v3}. That is, we add
\begin{displaymath}
    D_4 = \lambda_4 F + e_4,
\end{displaymath}
with assumptions similar to the ones of Question~\ref{f1v3}. Now suppose that $\lambda_2=0$, while the other factor loadings are non-zero.
    \begin{enumerate}
        \item Is $\lambda_2$ identifiable? Justify your answer.
        \item Are the other factor loadings identifiable? Justify your answer.
    \end{enumerate}

    \item \label{f1v5} Suppose we added a fifth variable to the model of Question~\ref{f1v4}. That is, we add
\begin{displaymath}
    D_5 = \lambda_5 F + e_5,
\end{displaymath}
with assumptions similar to the ones of Question~\ref{f1v3}. Now suppose that $\lambda_3=\lambda_4=0$, while the other factor loadings are non-zero.
    \begin{enumerate}
        \item Are $\lambda_3$ and $\lambda_4$ identifiable? Justify your answer.
        \item Are the other three factor loadings identifiable? Justify your answer.
        \item State the general pattern that is emerging here.
    \end{enumerate}

    \item \label{f2v6} We now extend the model of Question~\ref{f1v3} by adding a second factor. Let
        \begin{eqnarray*}
        D_1 & = & \lambda_1 F_1 + e_1 \\
        D_2 & = & \lambda_2 F_1 + e_2 \\
        D_3 & = & \lambda_3 F_1 + e_3 \\ 
        D_4 & = & \lambda_4 F_2 + e_4 \\
        D_5 & = & \lambda_5 F_2 + e_5 \\
        D_6 & = & \lambda_6 F_2 + e_6, 
        \end{eqnarray*}
where all expected values are zero, $Var(e_i)=\omega_i$ for $i=1, \ldots, 6$, $Var(F_1)=Var(F_2)=1$, $Cov(F_1,F_2) = \phi_{12}$, the factors are independent of the error terms, and all the error terms are independent of each other. All the factor loadings are non-zero, and they might be positive or negative. 
        \begin{enumerate}
            \item Give the covariance matrix of the observable variables. Show the necessary work. A lot of the work has already been done.
            \item \label{iDenT} Are the model parameters identifiable? Answer Yes or No and prove your answer.
            \item Write the model in matrix form as $\mathbf{d} = \boldsymbol{\Lambda} \mathbf{F} + \mathbf{e}$. That is give the matrices. For example, $\mathbf{d}$ is $6 \times 1$. 
            \item Recall that a \emph{rotation} matrix is any square matrix $\mathbf{R}$ satisfying $\mathbf{RR}^\top = \mathbf{I}$. Give a specific $2 \times 2$ rotation matrix $\mathbf{R}$ so that $\boldsymbol{\Lambda}$ and $\boldsymbol{\Lambda}_2 = \boldsymbol{\Lambda}\mathbf{R}$ yield the same $\boldsymbol{\Sigma} = cov(\mathbf{D})$. Hint: Use your answer to Question~\ref{iDenT}. 
            \item Suppose we add the conditions $\lambda_1>0$ and $\lambda_4>0$. Are the parameters identifiable now?
            \item In a goodness of fit test for this model, what are the degrees of freedom? % 6(6+1)/2 - 13 = 8.
        \end{enumerate}

    \item \label{f2v5} In Question~\ref{f2v6}, suppose we added just two variables along with the second factor. That is, we omit the equation for $D_6$, while keeping $\lambda_1>0$ and $\lambda_4>0$. Are the model parameters identifiable in this case? Answer Yes or No. Calculate or cite a rule.

    \item \label{f3v9} Let's add a third factor to the model of Question~\ref{f2v6}. That is, we keep the equation for $D_6$ and add
        \begin{eqnarray*}
        D_7 & = & \lambda_7 F_3 + e_7 \\
        D_8 & = & \lambda_8 F_3 + e_8 \\
        D_9 & = & \lambda_9 F_3 + e_9  
        \end{eqnarray*}
with $\lambda_1>0$, $\lambda_4>0$, $\lambda_7>0$ and other assumptions similar to the ones we have been using. Are the model parameters identifiable? You don't have to do any calculations if you see the pattern.
    
    \item \label{justone} In this factor analysis model, the observed variables are \emph{not} standardized, and the factor loading for $D_1$ is set equal to one.  Let
        \begin{eqnarray*}
        D_1 & = &           F + e_1 \\
        D_2 & = & \lambda_2 F + e_2 \\
        D_3 & = & \lambda_3 F + e_3, 
        \end{eqnarray*}
where $F \sim N(0,\phi)$, $e_1$, $e_2$ and $e_3$ are normal and independent of $F$ and each other with expected value zero, $Var(e_1)=\omega_1,Var(e_2)=\omega_2,Var(e_3)=\omega_3$, and $\lambda_2$ and $\lambda_3$ are nonzero constants.

        \begin{enumerate}
            \item Calculate the variance-covariance matrix of the observed variables.
            \item Are the model parameters identifiable? Answer Yes or No and prove your answer. You are proving another part of the 3-variable rule, so don't just cite it.
        \end{enumerate}

    \item \label{two} We now extend the preceding model by adding another factor. Let

        \begin{eqnarray*}
        D_1 & = &           F_1 + e_1 \\
        D_2 & = & \lambda_2 F_1 + e_2 \\
        D_3 & = & \lambda_3 F_1 + e_3 \\ 
        D_4 & = &           F_2 + e_4 \\
        D_5 & = & \lambda_5 F_2 + e_5 \\
        D_6 & = & \lambda_6 F_2 + e_6, 
        \end{eqnarray*}
where all expected values are zero, $Var(e_i)=\omega_i$ for $i=1, \ldots, 6$,
\begin{displaymath}  
\begin{array}{ccc}    % Array of Arrays: Nice display of matrices.
     cov\left( \begin{array}{c}
                 F_1 \\
                 F_2
    \end{array} \right)         & = &  
     \left( \begin{array}{c c}
                 \phi_{11} & \phi_{12} \\
                 \phi_{12} & \phi_{22}
    \end{array} \right),                                   
\end{array}
\end{displaymath}
and  $\lambda_2,\lambda_3, \lambda_5$ and $\lambda_6$ are nonzero constants. 
        \begin{enumerate}
            \item Give the covariance matrix of the observable variables. Show the necessary work. A lot of the work has already been done in Question~\ref{justone}.
            \item Are the model parameters identifiable? Answer Yes or No and prove your answer.
        \end{enumerate}

    \item Let's add a third factor to the model of Question~\ref{two}. That is, we add
        \begin{eqnarray*}
        D_7 & = &           F_3 + e_7 \\
        D_8 & = & \lambda_8 F_3 + e_8 \\
        D_9 & = & \lambda_9 F_3 + e_9 \\ 
        \end{eqnarray*}
and
\begin{displaymath}  
\begin{array}{cccc}    % Nice display of matrices.
     cov\left( \begin{array}{c}
                 F_1 \\
                 F_2 \\
                 F_3
    \end{array} \right)         & = &  
     \left( \begin{array}{c c c}
                 \phi_{11} & \phi_{12} & \phi_{13}  \\
                 \phi_{12} & \phi_{22} & \phi_{23}  \\
                 \phi_{13} & \phi_{23} & \phi_{33} 
    \end{array} \right),                                  
\end{array}
\end{displaymath}
with $\lambda_8\neq0$, $\lambda_9\neq0$ and so on. 
Are the model parameters identifiable? You don't have to do any calculations if you see the pattern.

%\newpage


\item This question leads to the two-variable, two-factor rule. Consider the following path diagram. 
% 2-variable, 2-factor rule, original model.
\begin{center}
 \includegraphics[width=3in]{TwoVar}
\end{center}

        \begin{enumerate}
            \item This is definitely a surrogate model. Give the equations of the original \emph{uncentered} model.
            \item The $\phi_{12}$ in the path diagram is actually $\phi_{12}^\prime$. Express $\phi_{12}^\prime$ in terms of the parameters of the original model.
            \item Give the covariance matrix for the surrogate model. Omit the primes from now on.
            \item Assuming $\lambda_2$, $\lambda_4$ and $\phi_{12}$ are all non-zero, show that all the parameters are identifiable. This is the 2-variable 2-factor rule, so don't just cite it. 
            \item Counting parameters and covariance structure equations, how many equality constraints on the covariance matrix should be implied by the model?
            \item What is the equality constraint? Multiply through by denominators so that there are no fractions.
            \item Would this equality constraint hold even with zero values for some of $\lambda_2$, $\lambda_4$ and $\phi_{12}$?
        \end{enumerate} % Actually, this is a pretty good question.

\item It's helpful to have a version of the Extra Variables Rule that does not depend on reference variables. That way, for example, we could freely add variables to the bi-factor model. Accordingly, let 
\begin{eqnarray*}
\mathbf{d}_1 &=&  \boldsymbol{\Lambda}_1 \mathbf{F} + \mathbf{e}_1 \\
\mathbf{d}_2 &=&  \boldsymbol{\Lambda}_2 \mathbf{F} + \mathbf{e}_2 \\
\mathbf{d}_3 &=&  \boldsymbol{\Lambda}_3 \mathbf{F} + \mathbf{e}_3 
\end{eqnarray*} 
where the random vectors $\mathbf{d}_1$ and $\mathbf{F}$ are $p \times 1$, and the $p \times p$ matrix $\boldsymbol{\Lambda}_1$ has an inverse, which it certainly will if the elements of $\mathbf{d}_1$ are reference variables.  The factors are independent of the error terms. 

Suppose that $\mathbf{d}_1$ and $\mathbf{d}_2$ belong to a model whose parameters have already been identified somehow, and we want to add $\mathbf{d}_3$ to the model. That is, $\boldsymbol{\Lambda}_1$, $\boldsymbol{\Lambda}_2$, $\boldsymbol{\Phi}$, $\boldsymbol{\Omega}_{11}$, $\boldsymbol{\Omega}_{12}$ and $\boldsymbol{\Omega}_{22}$ are identified, and we seek to identify $\boldsymbol{\Lambda}_3$, $\boldsymbol{\Omega}_{33}$ and $\boldsymbol{\Omega}_{23}$. It will be assumed that $\boldsymbol{\Omega}_{13} = \mathbf{O}$.

        \begin{enumerate}
            \item Write $\boldsymbol{\Sigma} = cov\left( \begin{array}{c}  \mathbf{d}_1 \\ \hline  \mathbf{d}_2 \\ \hline  \mathbf{d}_3   \end{array} \right)$ as a partitioned matrix.
            \item Show that $\boldsymbol{\Lambda}_3$, $\boldsymbol{\Omega}_{33}$ and $\boldsymbol{\Omega}_{23}$ are identifiable.
        \end{enumerate} % End of Extra Variables question 

\item Suppose that the parameters of factor analysis models for two non-overlapping sets of observable variables are identifiable, and we want to combine the two models. Suppose there are $p_1$ factors in model one and $p_2$ factors in model two, and the models can be written as
\begin{eqnarray*}
    \mathbf{d}_1 &=& \boldsymbol{\Lambda}_1\mathbf{F}_1 + \mathbf{e}_1 \\
    \mathbf{d}_2 &=& \boldsymbol{\Lambda}_2\mathbf{F}_1 + \mathbf{e}_2 \\
    \mathbf{d}_3 &=& \boldsymbol{\Lambda}_3\mathbf{F}_2 + \mathbf{e}_3 \\
    \mathbf{d}_4 &=& \boldsymbol{\Lambda}_4\mathbf{F}_2 + \mathbf{e}_4,
\end{eqnarray*} 
where $\mathbf{d}_1$ is $p_1 \times 1$, $\mathbf{d}_3$ is $p_2 \times 1$, and the square matrices $\boldsymbol{\Lambda}_1$ and $\boldsymbol{\Lambda}_3$ both have inverses. These conditions will definitely be satisfied if $\mathbf{d}_1$ contains reference variables for $\mathbf{F}_1$ and $\mathbf{d}_2$ contains reference variables for $\mathbf{F}_2$. 

        \begin{enumerate}
            \item The parameter matrices of the combined model are $\boldsymbol{\Lambda}_1$, $\boldsymbol{\Lambda}_2$, $\boldsymbol{\Lambda}_3$, $\boldsymbol{\Lambda}_4$, $\boldsymbol{\Phi}_{11}$, $\boldsymbol{\Phi}_{12}$, $\boldsymbol{\Phi}_{22}$, $\boldsymbol{\Omega}_{11}$, $\boldsymbol{\Omega}_{12}$, $\boldsymbol{\Omega}_{13}$, $\boldsymbol{\Omega}_{14}$, $\boldsymbol{\Omega}_{22}$, $\boldsymbol{\Omega}_{23}$, $\boldsymbol{\Omega}_{24}$, $\boldsymbol{\Omega}_{33}$, $\boldsymbol{\Omega}_{34}$ and  $\boldsymbol{\Omega}_{44}$ --- except let's set $\boldsymbol{\Omega}_{13} = cov(\mathbf{e}_1,\mathbf{e}_3) = \mathbf{O}$.
            Assuming $\boldsymbol{\Omega}_{13} = \mathbf{O}$, what parameters need to be identified for the combined model to be identifiable?
            \item Show how $\boldsymbol{\Phi}_{12} = cov(\mathbf{F}_1,\mathbf{F}_2)$ can be identified. 
            \item Show how $\boldsymbol{\Omega}_{14} = cov(\mathbf{e}_1,\mathbf{e}_4)$ can be identified. 
            \item Show how $\boldsymbol{\Omega}_{23} = cov(\mathbf{e}_2,\mathbf{e}_3)$ can be identified. 
            \item Show how $\boldsymbol{\Omega}_{24} = cov(\mathbf{e}_2,\mathbf{e}_4)$ can be identified. 
        \end{enumerate} % End of Extra Variables question 


% Prove vector 3-variable rule.


\item All the models with identifiable parameters are surrogate models --- all of them. Consider the model of Question~\ref{f2v6}.

        \begin{enumerate}
            \item Write the equations of the \emph{original} uncentered model. You don't have to give additional specifications; just write the equations.

\newpage %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

            \item Noting that the equations of the centered original model look exactly like the ones in Question~\ref{f2v6}, show how the model of Question \ref{f2v6} arises from a re-parameterization of the centered original model by a change of variables --- actually, two changes of variables. Do it this way. 
                \begin{enumerate}
                    \item Re-write the model equations, showing what happens to the factor loadings.
                    \item Denote the variances and covariances of factors covariance under the original model by $\phi_{ij}$, and the variances and covariances under the surrogate model as $\phi^\prime_{ij}$ What is $\phi^\prime_{12}$ in terms of the parameters of the original model?
                    \item How are the $\omega_j$ affected by the re-parameterization?
                \end{enumerate}
            \item How much of $Var(D_2)$ is explained by $F_1$ under the original model?
            \item How much of $Var(D_2)$ is explained by $F_1^\prime$ under the surrogate model? Make sure you put a prime on the parameter(s).
            \item Did re-parameterization affect the explained variance?
        \end{enumerate} % End of surrogate model question

\newpage %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

    \item \label{Rpoverty} The R part of this assignment is based on the Poverty Data. The data are given in the file
\href{http://www.utstat.toronto.edu/brunner/data/illegal/poverty.data.txt}
     {\texttt{http://www.utstat.toronto.edu/brunner/data/illegal/poverty.data.txt}}.  
This data set contains information from a sample of 97 countries. In order, the variables include Live birth rate per 1,000 of population, Death rate per 1,000 of population, Infant deaths per 1,000 of population under 1 year old, Life expectancy at birth for males, Life expectancy at birth for females, and Gross National Product per capita in U.S. dollars. There is also a variable with numeric values representing continent, and finally the name of the country. 

When you read the data, use the \texttt{na.strings = "."} option on \texttt{read.table}. This is so that the SAS missing value code, a period, will be treated as \texttt{NA}.

The poverty data set can be very challenging and frustrating to work with, because correlated measurement errors produce negative variance estimates and other numerical problems almost everywhere you turn. To make your job easier (possible), please confine your analyses to the following four variables:
    \begin{itemize}
        \item Life Expectancy: Average of life expectancy for males and life expectancy for females. 
        \item Infant mortality rate.
        \item Birth rate.
        \item GNP/1000 = Gross national product in thousands of dollars. The re-scaling is a solution to numerical problems in fitting the model.
    \end{itemize}

Here is a picture of a factor analysis model with 2 factors.
\begin{center}      
\includegraphics[width=3in]{PovFactorPic} % Need   \usepackage{graphicx}
\end{center}
The reason for making birth rate an indicator of wealth is that birth control costs money. 

        \begin{enumerate}
        \item Fit the model with \texttt{lavaan}. Don't bootstrap. My value of $\widehat{\omega}_1$ is 1.432.
You can't use the original model; you'll have to re-parameterize. Which of the two standard re-parameterizations should you choose? Suppose we are interested in the correlation between Health and Wealth. 
        \item Does this model fit the data adequately? Answer Yes or No, and back up your answer with two numbers from the printout: the value of a test statistic, and a $p$-value. % G^2 = 1.098, p = 0.295
        \item Why does the goodness of fit test have one degree of freedom?
        \item What is the maximum likelihood estimate of the correlation between factors? The answer is a single number from the printout.  % 0.961
        \item Give a 95\% confidence interval for the correlation between factors. Do it the easy way. What is funny about the confidence interval?
        \item Now fit a model with the other common re-parameterization. Request standardized output when you apply \texttt{summary}.
        \begin{enumerate}
            \item Compare the two likelihood ratio tests for model fit. What do you see?
            \item Compare the two $\boldsymbol{\Sigma(\widehat{\theta})}$ matrices (not part of summary; see lecture slides). What do you see?
            \item Give the maximum likelihood estimate of $\lambda_2/\lambda_1$ for the \emph{original} model, based on output from the \emph{first} surrogate model you fit. Can you find this number in the output from the second model? 
% -44.025/10.256 = -4.292609, compare lambda2hat = -4.293 from second model.
            \item Based on the output from the second model, give the maximum likelihood estimate of the correlation between Health and Wealth. Can you find this number in the output from the first model? % Std.lv  Std.all both give 0.961. 
%            \item Estimate the proportion of variance in infant mortality rate that is explained by the Health factor. You might need a calculator, but it's quick if you do it the easy way. % (-0.956)^2 = 0.913936 This is wrong. There's a covariance term.
        \end{enumerate}
        \item Finally, the high estimated correlation between factors suggests that there might be just one underlying factor: wealth. Try a single-factor model and see if it fits. Locate the relevant chi-squared statistic, degrees of freedom and $p$-value. Do the estimated factor loadings make sense? What do you conclude? Do you like the one-factor model or the two-factor model? % Chi-squared = 2.823, df=2, p = 0.244  I like it. 
        \end{enumerate} % End of R question
\end{enumerate} % End of all the questions

% \vspace{20mm}

% \pagebreak
\vspace{3mm}
\noindent
\textbf{Please bring a printout of your full R input and output for Question \ref{Rpoverty} to the quiz.}


\end{document}