% 431s23Assignment9.tex     Principal components and exploratory factor analysis
\documentclass[11pt]{article} 
%\usepackage{amsbsy} % for \boldsymbol and \pmb 
\usepackage{graphicx} % To include pdf files!
\usepackage{amsmath}
\usepackage{amsbsy}
\usepackage{amsfonts}
\usepackage{comment}
\usepackage{alltt} % For colouring in verbatim-like environment.
\usepackage[colorlinks=true, pdfstartview=FitV, linkcolor=blue, citecolor=blue, urlcolor=blue]{hyperref} % For links
\usepackage{fullpage}
%\pagestyle{empty} % No page numbers


\begin{document}

%\enlargethispage*{1000 pt} 


\begin{center}   
{\Large \textbf{STA 431s23 Assignment Nine}}\footnote{This assignment was prepared by  \href{http://www.utstat.toronto.edu/~brunner}{Jerry Brunner},
Department of Statistical Sciences, University of Toronto. It is licensed under a 
\href{http://creativecommons.org/licenses/by-sa/3.0/deed.en_US}
     {Creative Commons Attribution - ShareAlike 3.0 Unported License}. Use any part of it as you like and share the result freely. The \LaTeX~source code is available from the course website:
\href{http://www.utstat.toronto.edu/brunner/oldclass/431s23} {\small\texttt{http://www.utstat.toronto.edu/brunner/oldclass/431s23}}}
\vspace{1 mm}
\end{center}

\noindent
\emph{For the Quiz on Friday March 31st, please bring a printout of your full R input and output for Question~\ref{Rstatclass}. The other problems are not to be handed in. They are practice for the Quiz.}
\vspace{2mm}
\hrule

\begin{enumerate} 

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Principal Components %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\item Let $\mathbf{z} \sim N_k(\mathbf{0}, \boldsymbol{\Sigma})$, where the elements of $\mathbf{z}$ are standardized, so that $\boldsymbol{\Sigma} = \mathbf{CDC}^\top$ is a correlation matrix. 
    \begin{enumerate}
        \item What is the distribution of $\mathbf{y} = \mathbf{C}^\top \mathbf{z}$? (The elements of $\mathbf{y}$ are the \emph{principal components} of $\mathbf{z}$.)
        \item What is the variance of the scalar random variable $y_j$, that is, element $j$ of $\mathbf{y}$?
        \item How do you know that the elements of $\mathbf{y}$ are independent?
        \item Write $\mathbf{z}$ as a function of $\mathbf{y}$.
        \item Using the notation $\mathbf{C} = [c_{ij}]$, 
                \begin{enumerate}
                    \item Write the scalar random variable $z_1$ as a function of $y_1, \ldots, y_k$.
                    \item What is $Var(z_1)$?
                    \item \label{explvar} What proportion of $Var(z_1)$ is ``explained" by $y_4$?
                \end{enumerate}



        \item Calculate $cov(\mathbf{z}, \mathbf{y})$. Simplify.
        \item What is element $i,j$ of the matrix $cov(\mathbf{z}, \mathbf{y})$? Give your answer in terms of the $c_{ij}$ and $\lambda_j$.
        \item \label{sqcorr} What is the squared correlation between $z_1$ and $y_4$? Compare this to your answer to Question~\ref{explvar}.
        \item To answer that last question, you needed to standardize the principal component $y_4$. The whole vector of principal components can be standardized with $\mathbf{y}_2 = \mathbf{D}^{-\frac{1}{2}} \mathbf{y}$. Verify that the standardization works by calculating $cov(\mathbf{y}_2)$. 
        \item Write $\mathbf{z}$ as a function of $\mathbf{y}_2$.
        \item Calculate $cov(\mathbf{z}, \mathbf{y}_2)$. Because $\mathbf{z}$ and $\mathbf{y}_2$ are both standardized, this is a matrix of correlations.
        \item Question~\ref{sqcorr} tells us that if you were to square the elements in column $j$ of $cov(\mathbf{z}, \mathbf{y}_2)$ and add them up, you'd get the total amount of variance in all the $z_i$ variables that is explained by principal component $j$. Calculate all these quantities at once with a matrix operation. What you want are the diagonal elements of a certain matrix product. % Answer is D.
    \end{enumerate}
 

\pagebreak %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\item We usually don't retain all $k$ principal components. Instead, we summarize the variables with a smaller set of $p$ principal components that explain a good part of the total variance. Typically, components associated with eigenvalues greater than one are retained. This may be accomplished with a $p \times k$ \emph{selection matrix} that will be denoted by $\mathbf{J}$. Each row of $\mathbf{J}$ has a one in the position of a component to be retained, and the rest zeros. For example, if there were five principal components, the first two would be selected as follows.
\begin{equation*}
    \mathbf{Jy} =
    \left(\begin{array}{ccccc}
    1 & 0 & 0 & 0 & 0 \\
    0 & 1 & 0 & 0 & 0 
    \end{array}\right) 
    \left(\begin{array}{c}
    y_1 \\ y_2 \\ y_3 \\ y_4 \\ y_5
    \end{array}\right) = 
    \left(\begin{array}{c} y_1 \\ y_2  \end{array}\right).
\end{equation*}
If $\mathbf{A}$ is any $k \times k$ matrix, then $\mathbf{JAJ}^\top$ is the $p \times p$ sub-matrix with rows and columns indicated by $\mathbf{J}$. A sub-matrix of the identity is another (smaller) identity matrix, so $\mathbf{JJ}^\top = \mathbf{I}_p$. Selection matrices are quite flexible and can even be used to re-order variables, but here they will just be used to select the first $p$ principal components.

So, let $\mathbf{J}$ be the matrix that selects the first $p$ elements of a vector. Let $\mathbf{f} = \mathbf{Jy}_2$. That's the first $p$ standardized principal components.

    \begin{enumerate}
        \item Show $cov(\mathbf{f}) = \mathbf{I}_p$.
        \item Let $\mathbf{L} = cov(\mathbf{z}, \mathbf{f})$. Calculate $\mathbf{L}$. This is the matrix of correlations between the $z$ variables and the first $p$ principal components. That is, it should be the first $p$ columns of $cov(\mathbf{z}, \mathbf{y}_2) = \mathbf{CD}^\frac{1}{2}$. Indeed, post-multiplication by $\mathbf{J}^\top$ selects the first $p$ columns.  
        \item Let $\mathbf{f}^\prime = \mathbf{Rf}$, where $\mathbf{R}$ is an orthogonal (rotation) matrix.
                \begin{enumerate}
                    \item Calculate $cov(\mathbf{f}^\prime)$. 
                    \item Calculate $cov(\mathbf{z}, \mathbf{f}^\prime)$. Leave your answer in terms of $\mathbf{L}$. This is very quick. Why is $cov(\mathbf{z}, \mathbf{f}^\prime)$ a matrix of correlations, rather than just covariances?
                    \item If you square the correlations in row $j$ of $\mathbf{L}$ and add them up, you get the proportion of variance in $z_j$ that is explained by the first $p$ principal components. Show that this quantity is not affected by rotation of the components. Hint: calculate all the proportions of explained variation at once with a matrix operation. What you want are the diagonal elements of a certain matrix product. If you do the operation on $cov(\mathbf{z}, \mathbf{f})$ and $cov(\mathbf{z}, \mathbf{f}^\prime)$, you get the same answer.
                \end{enumerate}
    \end{enumerate}


\pagebreak %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%


\item The following is based on data from one of my classes.

\begin{alltt}
{\color{blue}> pc3 = prcomp(dat, scale = T, rank=3)
> L = cor(dat,pc3$x) # Correlations of variables with components
> round(L,3) }
          PC1    PC2    PC3
Quiz 1  0.546 -0.396  0.377
Quiz 2  0.657 -0.156  0.237
Quiz 3  0.646 -0.287  0.133
Quiz 4  0.614  0.225  0.446
Quiz 5  0.606  0.417 -0.185
Quiz 6  0.511  0.616  0.106
Quiz 7  0.589  0.390 -0.257
Quiz 8  0.539 -0.239 -0.582
Quiz 9  0.730 -0.052 -0.319
Quiz 10 0.318 -0.351 -0.337
Final   0.763 -0.201  0.186

{\color{blue}> vm3 = varimax(L); Rt = vm3$rotmat
> L2 = L %*% Rt # Also, L2 = vm3$loadings
> round(L2,3) }
         [,1]   [,2]   [,3]
Quiz 1  0.763 -0.050 -0.115
Quiz 2  0.658  0.225 -0.168
Quiz 3  0.641  0.127 -0.303
Quiz 4  0.602  0.481  0.181
Quiz 5  0.138  0.711 -0.228
Quiz 6  0.168  0.779  0.130
Quiz 7  0.093  0.688 -0.289
Quiz 8  0.115  0.193 -0.798
Quiz 9  0.330  0.422 -0.593
Quiz 10 0.156 -0.056 -0.557
Final   0.718  0.257 -0.274
\end{alltt}

You will need a calculator for these questions. Please round your answers to three decimal places.
    \begin{enumerate}
        \item What proportion of the variance in Quiz 9 score is explained by the first \emph{unrotated} principal component? % 0.730^2 = 0.5329
        \item What proportion of the variance in Quiz 9 score is explained by the first \emph{rotated} principal component? % 0.330^2 = 0.1089
        \item What proportion of the variance in Final Exam score is explained by the unrotated principal components? % 0.763^2 + 0.201^2 + 0.186^2 = 0.657
        \item What proportion of the variance in Final Exam score is explained by the rotated principal components? % 0.718^2 + 0.257^2 + 0.274^2 = 0.657
    \end{enumerate}
        


\pagebreak 
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%  Factor Analysis  %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\item \label{scalarEFA} Independently for $i = 1, \ldots, n$, let
\begin{eqnarray*}
    z_{i,1} &=& \lambda_{11} F_{i,1} +  \lambda_{12} F_{i,2} + e_{i,1} \\
    z_{i,2} &=& \lambda_{21} F_{i,1} +  \lambda_{22} F_{i,2} + e_{i,2} \\
    z_{i,3} &=& \lambda_{31} F_{i,1} +  \lambda_{32} F_{i,2} + e_{i,3} \\
    z_{i,4} &=& \lambda_{41} F_{i,1} +  \lambda_{42} F_{i,2} + e_{i,4} \\
    z_{i,5} &=& \lambda_{51} F_{i,1} +  \lambda_{52} F_{i,2} + e_{i,5},
\end{eqnarray*} 
where all expected values are zero, $Var(F_{i,1}) = Var(F_{i,2}) = 1$, and all the $F_{i,j}$ and $e_{i,j}$ are independent. As the notation suggests, the $z_{i,j}$ are standardized, so that $Var(z_{i,j}) = 1$ for all $i$ and $j$. Only the $z_{i,j}, $ are observable. 

Please give Greek letter answers to the following. Be able to show your work if necessary.
    \begin{enumerate}
        \item What is $Var(e_{i,2})$?
        \item What is the uniqueness of $z_{i,2}$?
        \item What is the communality of $z_{i,2}$?
        \item What is $Corr(z_{i,3}, F_{i,2})$?
        \item What is the reliability of $z_{i,3}$ as a measurement of $F_{i,2}$?
        \item What is the reliability of $s_i = z_{i,1} + z_{i,2} + z_{i,3} + z_{i,4} + z_{i,5}$ as a measurement of $F_{i,1}$? You can see that this is general. 
        \item What proportion of the variance in $z_{i,4}$ is explained by the common factors?
        \item What proportion of the variance in the observable variables is explained by Factor One? You are being asked for a proportion, so the answer is between zero and one.
        \item What is $Cov(z_{i,2}, z_{i,5})$?
        \item What are the parameters of this model?
        \item All the factor loadings are correlations, so you might think that the parameter space is  a hyper-cube, running from $-1$ to $+1$ in eight dimensions. However \ldots, what inequality constraint must $\lambda_{21}$ and $\lambda_{22}$ obey? 
        \item What does this look like geometrically, in just the $(\lambda_{21},\lambda_{22})$ plane?
% So the parameter space might be the intersection of a collection of cylinders. There are some more constraints I think, because the whole Sigma matrix has to be at lest non-negative definite. Maybe the whole thing is a hyper-sphere, but I'm not sure. 
        \item Does this model pass the test of the Parameter Count Rule? Answer Yes or No and give the numbers. Remember, $\boldsymbol{\Sigma}$ is a correlation matrix, so there are no equations corresponding to the diagonal elements.
    \end{enumerate}

\pagebreak %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\item \label{matrixEFA} Consider the general factor analysis model     
        \begin{displaymath} 
            \mathbf{d}_i = \boldsymbol{\Lambda} \mathbf{F}_i + \mathbf{e}_i,
        \end{displaymath}

where $\boldsymbol{\Lambda}$ is a $k\times p$ matrix of factor loadings, the vector of factors $\mathbf{F}_i$ is a $p\times 1$ multivariate normal with expected value zero and covariance matrix $\boldsymbol{\Phi}$, and $\mathbf{e}_i$ is multivariate normal and independent of $\mathbf{F}_i$, with expected value zero and covariance matrix $\boldsymbol{\Omega}$. All covariance matrices are positive definite. 
        \begin{enumerate}
            \item Calculate the matrix of covariances between the observable variables $\mathbf{d}_i$ and the underlying factors $\mathbf{F}_i$.
            \item Give the covariance matrix of $\mathbf{d}_i$. 
            \item Because $\boldsymbol{\Phi}$ symmetric and positive definite, it has a square root matrix that is also symmetric. Using this, show that the parameters of the general factor analysis model are not identifiable.
            \item \label{standfact} In an attempt to obtain a model whose parameters can be successfully estimated, let $\boldsymbol{\Omega}$ be diagonal (errors are uncorrelated) and set $\boldsymbol{\Phi}$ to the identity matrix (standardizing the factors). Show that the parameters of this revised model are still not identifiable. Hint: An orthogonal matrix $\mathbf{R}$ (corresponding to an orthogonal rotation) is one satisfying $\mathbf{RR}^\top=\mathbf{I}$.

\item As in Question~\ref{standfact}, suppose that $\boldsymbol{\Phi}$ is set to the identity matrix, standardizing the factors as well as making them uncorrelated. In addition, standardize the observable data to obtain $\mathbf{z}_i$. Write
\begin{eqnarray*}
    \mathbf{z}_i & = & \boldsymbol{\Lambda} \mathbf{F}_i + \mathbf{e}_i \\
                 & = & \boldsymbol{\Lambda} \mathbf{R}^\top \mathbf{R} 
                        \mathbf{F}_i + \mathbf{e}_i \\ 
                 & = & \boldsymbol{\Lambda}_2 \mathbf{F}^\prime_i + \mathbf{e}_i \\
\end{eqnarray*}

        \begin{enumerate}
            \item Calculate $cov(\mathbf{z}_i, \mathbf{F}_i)$ and $cov(\mathbf{z}_i, \mathbf{F}_i^\prime)$. These are matrices of correlations.
            \item Show that the communalities of the variables are not affected by rotation. You want the diagonal of a certain matrix product. 
        \end{enumerate}
    \end{enumerate} % End matrix EFA question, I think

\pagebreak %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\item \label{Rstatclass} The \texttt{statclass} data include marks on quizzes, computer assignments, a midterm test and the final exam. The column labelled \texttt{S} is sex, and the column labelled \texttt{E} is ethnic background. We will not use \texttt{S} or \texttt{E} on this assignment. They are both just guesses by the prof, and are are likely measured with error. We are just going to do an exploratory factor analysis on the other variables. The data are available at

% Had full data file in 429f04, and it's used in the data analysis text. I have always thought this was STA302 in 1990, but the number of quizzes and computer assignments is wrong.


\begin{center}
\href{https://www.utstat.toronto.edu/brunner/openSEM/data/statclass.data.txt}
     {\texttt{https://www.utstat.toronto.edu/brunner/openSEM/data/statclass.data.txt}}.
\end{center}
Use the \texttt{header=TRUE} option on \texttt{read.table()}.

    \begin{enumerate}
        \item To help decide on the number of factors, calculate the eigenvalues of the correlation matrix and prepare a scree plot. Print hard copy of the scree plot and bring it to the quiz. 
        \item The number of eigenvalues greater than one and the scree plot point to different answers. Going with the smaller number, carry out a maximum likelihood factor analysis with a varimax rotation. Does it fit? (Use the $\alpha=0.05$ significance level, of course.) Be able to give the chi-squared statistic, the degrees of freedom and the $p$-value.
        \item Amazingly, the model fits. Try a smaller number of factors, and keep trying until the model no longer fits. For all the models you estimate, be able to give the chi-squared statistic, the degrees of freedom and the $p$-value.
        \item I have a lot of trouble deciding between~3 and~4 factors. I finally decided to go with four, even though Factor~4 seems to be dominated by Computer Assignment~3, whatever that was. This is not really a question. I suppose the correct answer is ``Okay."
        \item For the 4-factor model, obtain communalities in two different ways. One way to get them is to extract the diagonal of a certain matrix product with the \texttt{diag()} function. The other way is more obvious. Use \texttt{cbind()} to display the two sets of estimates side by side.
        \item What estimated proportion of the variance of Computer Assignment 3 is explained by the common factors?
        \item What estimated proportion of the final exam's variance is explained by the common factors?
        \item What is the estimated correlation between score on the midterm and Factor~1?
        \item What is the estimated reliability of the final exam as a measure of Factor~1?
    \end{enumerate} % End of R question

\end{enumerate} % End of all the questions

\vspace{20mm}

% \pagebreak
\vspace{3mm}
\noindent
\textbf{Please bring a printout of your full R input and output for Question \ref{Rstatclass} to the quiz.}


\end{document} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Look at A10


\begin{comment}

I cut this one out because it was getting too complicated.

        \item My guess is that Factor~1 is the main thing that assessment in the course was trying to measure. There were eight quizzes and nine computer assignments in addition to the midterm and the final. The midterm and the final were out of 100 points, while the quizzes and computer assignments were out of 10. The mark was computed as
\begin{center}
     \texttt{mark = 0.3*quizave + 0.1*compave + 0.3*midterm + 0.3*final}
\end{center}
The question is, what is the estimated reliability of \texttt{mark} as a measure of Factor~1. It's a bit of work to get the answer, but it's important because it's an indication of how effective (fair?) the marking was. Assume that the individual marks (not the averages) are standardized. Use the definition of reliability from the formula sheet. Some paper and pencil work will be needed before you know what to compute. The answer is a single number between zero and one. 

Actual computation was with unstandardized data:

mark = 0.3*quizave*10 + 0.1*compave*10 + 0.3*midterm + 0.3*final

                \begin{enumerate}
                    \item 
                    \item 
                \end{enumerate}



\end{comment}