% 431s23Assignment6.tex     Measurement error, ignoring measurement error in regression
\documentclass[11pt]{article} 
%\usepackage{amsbsy} % for \boldsymbol and \pmb 
\usepackage{graphicx} % To include pdf files!
\usepackage{amsmath}
\usepackage{amsbsy}
\usepackage{amsfonts}
\usepackage{comment}
\usepackage[colorlinks=true, pdfstartview=FitV, linkcolor=blue, citecolor=blue, urlcolor=blue]{hyperref} % For links
\usepackage{fullpage}
%\pagestyle{empty} % No page numbers

% Formula sheet needs reliability


\begin{document}

%\enlargethispage*{1000 pt} 


\begin{center}   
{\Large \textbf{STA 431s23 Assignment Six}}\footnote{This assignment was prepared by  \href{http://www.utstat.toronto.edu/~brunner}{Jerry Brunner},
Department of Statistical Sciences, University of Toronto. It is licensed under a 
\href{http://creativecommons.org/licenses/by-sa/3.0/deed.en_US}
     {Creative Commons Attribution - ShareAlike 3.0 Unported License}. Use any part of it as you like and share the result freely. The \LaTeX~source code is available from the course website:
\href{http://www.utstat.toronto.edu/brunner/oldclass/431s23} {\small\texttt{http://www.utstat.toronto.edu/brunner/oldclass/431s23}}}
\vspace{1 mm}
\end{center}

\begin{comment}
\noindent
\emph{For the Quiz on Friday Feb.~17th, please bring a printout of your full R input and output for Question~\ref{smokeR}. The other problems are not to be handed in. They are practice for the Quiz.}
\vspace{2mm}
\hrule
\end{comment}


\begin{enumerate} 


%%%%%%%%%%%%%%%%%%%%%%%% Measurement Error %%%%%%%%%%%%%%%%%%%%%%%%

   \item\label{measurementbias} In a study of diet and health, suppose we want to know how much snack food each person eats, and we ``measure" it by asking a question on a questionnaire. Surely there will be measurement error, and suppose it is of a simple additive nature. But we are pretty sure people under-report how much snack food they eat, so a model like~$W = X + e$ with $E(e)=0$ is hard to defend. Instead, let
\begin{displaymath}
    W = \nu + X + e,
\end{displaymath}
where $E(X)=\mu_x$, $E(e)= 0$, $Var(X)=\sigma^2_x$, $Var(e)=\sigma^2_e$, and $Cov(X,e)=0$
The unknown constant $\nu$ could be called \emph{measurement bias}. Calculate the reliability of $W$ for this model. Is it the same as the expression for reliability given in the text and lecture, or does $\nu\neq 0$ make a difference? % Lesson: Assuming expected values and intercepts zero does no harm.

    \item Continuing Question~\ref{measurementbias}, suppose that two measurements of $W$ are available. 
\begin{eqnarray}
    W_1 & = & \nu_1 + X + e_1  \nonumber \\ 
    W_2 & = & \nu_2 + X + e_2, \nonumber 
\end{eqnarray}
where $E(X)=\mu_x$, $Var(X)=\sigma^2_x$,  $E(e_1)=E(e_2)=0$, $Var(e_1)=Var(e_2)=\sigma^2_e$, and  $X$, $e_1$ and $e_2$ are all independent. Calculate $Corr(W_1,W_2)$. Does this correlation still equal the reliability even when $\nu_1$ and $\nu_2$ are non-zero and potentially different from one another? % Yes. Intercepts don't matter. 

    \item\label{goldstandard} Let $X$ be a latent variable, $W = X + e_1$ be the usual measurement of $X$ with error, and $G = X+e_2$ be a measurement of $X$ that is deemed ``gold standard," but of course it's not completely free of measurement error. It's better than $W$ in the sense that $0<Var(e_2)<Var(e_1)$, but that's all you can really say. This is a realistic scenario, because nothing is perfect. Accordingly, let 
\begin{eqnarray}
    W & = & X + e_1 \nonumber \\ \nonumber
    G & = & X + e_2,          \\ \nonumber
\end{eqnarray}
where $E(X)=\mu_x$, $Var(X)=\sigma^2_x$,  $E(e_1)=E(e_2)=0$, $Var(e_1)=\sigma^2_1$, $Var(e_2)=\sigma^2_2$ and $X$, $e_1$ and $e_2$ are all independent of one another. 
    \begin{enumerate}
        \item Make a path diagram of this model.
        \item Prove that the squared correlation between $W$ and $G$ is strictly less than the reliability of $W$. Show your work.

The idea here is that the squared \emph{population} 
correlation\footnote{When we do Greek-letter calculations, we are figuring out what is happening in the population from which a data set might be a random sample.}
between an ordinary measurement and an imperfect gold standard measurement is strictly less than the actual reliability of the ordinary measurement. If we were to estimate such a squared correlation by the corresponding squared \emph{sample} correlation, we would be estimating a quantity that is not the reliability. On the other hand, we would be estimating a lower bound for the reliability, and this could be reassuring if it were a high number.
    \end{enumerate}

    \item\label{testlength} Suppose we have two equivalent measurements with uncorrelated measurement error:
\begin{eqnarray*}
    W_1 & = & X + e_1  \\ 
    W_2 & = & X + e_2,          
\end{eqnarray*}
where $E(X)=\mu_x$, $Var(X)=\sigma^2_x$,  $E(e_1)=E(e_2)=0$, $Var(e_1)=Var(e_2)=\sigma^2_e$, and  $X$, $e_1$ and $e_2$ are all independent. The ``equivalent" part means the two measurements have the same amount of noise. What if we were to measure the true score $X$ by adding the two imperfect measurements together? Would the result be more reliable?
        \begin{enumerate}
            \item Let $S=W_1+W_2$. Show that the reliability of $S$ is $\frac{\sigma^2_x}{\sigma^2_x+\frac{1}{2}\sigma^2_e}$. Is this greater than $\frac{\sigma^2_x}{\sigma^2_x+\sigma^2_e}$?
            \item Suppose you take $n$ independent measurements (in psychometric theory, these would be called equivalent test items). Show that the reliability of $S_n=\sum_{i=1}^n W_i$ is $\frac{\sigma^2_x}{\sigma^2_x+\frac{1}{n}\sigma^2_e}$.
            \item What is the reliability of $\overline{W}_n=\frac{1}{n}\sum_{i=1}^n W_i$? Show your work.
            \item What happens to the reliability of $S_n$ and $\overline{W}_n$ as the number of measurements $n \rightarrow \infty$? 
        \end{enumerate}

Equivalent test items may be largely a fantasy, but this question shows how equivalent \emph{tests} is a goal that can be closely approximated in practice. In the two equations displayed above, $W_1$ and $W_2$ might not be test items, but tests composed of multiple items. Each item might have a different error variance. But if the two \emph{sums} or \emph{averages} of the error variances are the same, the two tests are equivalent. This is nice, because it tells you that two tests do not need to be matched item for item in order to be equivalent.

\item Consider the two equivalent measurements at the start of Question~\ref{testlength}. It is easy to imagine omitted variables that would affect both observed scores. For example, if $W_1$ and $W_2$ are two questionnaires about eating habits, some people will probably mis-remember or lie the same way on both questionnaires. Since $e_1$ and $e_2$ represent all other influences apart from the true quantity being measured, this means that $e_1$ and $e_2$ will have non-zero covariance. Furthermore, this covariance will be positive, since the omitted variables (there could be dozens of them) will tend to affect the two measurements in the same way. Accordingly, in the initial model of Question~\ref{testlength}, let $Cov(e_1,e_2)=c>0$.
    \begin{enumerate}
        \item Draw a path diagram of the model.
        \item Show that $Corr(W_1,W_2)$ is strictly \emph{greater} than the reliability. 

This means that in practice, omitted variables will result in over-estimates of reliability. There are almost always omitted variables. 
    \end{enumerate}

\newpage
%%%%%%%%%%%%%%%%%%%%%%%% Regression with measurement error %%%%%%%%%%%%%%%%%%%%%%%%

\item This question explores the consequences of ignoring measurement error in the response variable. Independently for $i=1, \ldots,n$, let
\begin{eqnarray}
    Y_i &=& \beta_0 + \beta_1 X_i + \epsilon_i 
\nonumber \\
    V_i &=&  Y_i + e_i,  \nonumber
\end{eqnarray}
where $Var(X_i)=\phi$, $E(X_i) = \mu_x$, $Var(e_i)=\omega$,  $Var(\epsilon_i)=\psi$,
and $X_i, e_i, \epsilon_i$ are all independent. The explanatory variable $X_i$ is observable, but the response variable $Y_i$ is latent. Instead of $Y_i$, we can see $V_i$, which is $Y_i$ plus a piece of random noise.  Call this the \emph{true model}.
    \begin{enumerate}
        \item Make a path diagram of the true model.
        \item Strictly speaking, the distributions of $X_i, e_i$ and $\epsilon_i$ are unknown parameters because they are unspecified. But suppose we are interested in identifying just the Greek-letter parameters. Does the true model pass the test of the Parameter Count Rule? Answer Yes or No and give the numbers.
        \item Calculate the variance-covariance matrix of the observable variables as a function of the model parameters. Show your work.
        \item Suppose that the analyst assumes that $V_i$ is that same thing as $Y_i$, and fits the naive model $ V_i = \beta_0 + \beta_1 X_i + \epsilon_i$, in which
\begin{displaymath}
    \widehat{\beta}_1 = \frac{\sum_{i=1}^n(X_i-\overline{X})(V_i-\overline{V})}
                           {\sum_{i=1}^n(X_i-\overline{X})^2}.
\end{displaymath}
Assuming the \emph{true} model (not the naive model), is $\widehat{\beta}_1$ a consistent estimator of $\beta_1$? Answer Yes or No and show your work. 
        \item Why does this prove that $\beta_1$ is identifiable?
    \end{enumerate}

% A 2015 HW question, improved in 2017
\item \label{randiv} This question explores the consequences of ignoring measurement error in the explanatory variable when there is only one explanatory variable. Independently for $i = 1 , \ldots, n$, let 
\begin{eqnarray*}
    Y_i & = &  \beta X_i + \epsilon_i \\
    W_i & = & X_i + e_i
\end{eqnarray*}
where all random variables are normal with expected value zero, $Var(X_i)=\phi>0$,  $Var(\epsilon_i)=\psi>0$, $Var(e_i)=\omega>0$ and $\epsilon_i$, $e_i$ and $X_i$ are all independent.  The variables $W_i$ and $Y_i$ are observable, while $X_i$ is latent. Error terms are never observable. 
    \begin{enumerate}
        \item What is the parameter vector $\boldsymbol{\theta}$ for this model?
        \item Denote the covariance matrix of the observable variables by  $\boldsymbol{\Sigma} = [\sigma_{ij}]$. The unique $\sigma_{ij}$ values are the moments, and there is a covariance structure equation for each one. Calculate the variance-covariance matrix $\boldsymbol{\Sigma}$ of the observable variables, expressed as a function of the model parameters. You now have the covariance structure equations.
        \item Does this model pass the test of the parameter count rule? Answer Yes or No and give the numbers.
        \item Are there any points in the parameter space where the parameter $\beta$ is identifiable? Are there infinitely many, or just one point?
        \item The naive estimator of $\beta$ is 
            \begin{displaymath}
                \widehat{\beta}_n = \frac{\sum_{i=1}^n W_i Y_i}{\sum_{i=1}^n W_i^2}.
            \end{displaymath}        
Is $\widehat{\beta}_n$ a consistent estimator of $\beta$? To what does $\widehat{\beta}_n$ converge?
        \item Are there any points in the parameter space for which $\widehat{\beta}_n$ converges to the right answer? Compare your answer to the set of points where $\beta$ is identifiable.
        \item Suppose the reliability of $W_i$ were known, or to be more realistic, suppose that a good estimate of the reliability were available; call it $r^2_{wx}$. How could you use $r^2_{wx}$ to improve $\widehat{\beta}_n$? Give the formula for an improved estimator of $\beta$.  
%        \item Because of correlated measurement error, one suspects that many published estimates of reliability are too high. Suppose $r^2_{wx}$ is an overestimate of the true reliability $\rho^2_{wx}$. What effect does this have on your improved estimate of $\beta$? 
    \end{enumerate}

% The core of this was on the 2015 final, but it's enhanced in 2017.
\item The improved version of $\widehat{\beta}_n$ in the last question is an example of \emph{correction for attenuation} (weakening) caused by measurement error. Here is the version that applies to correlation. Independently for $i=1, \ldots, n$, let

% Need eqnarray inside a parbox to make it the cell of a table
\begin{tabular}{ccc}
\parbox[m]{1.5in} {
\begin{eqnarray*}
    D_{i,1} &=& F_{i,1} + e_{i,1}   \\
    D_{i,2} &=& F_{i,2} + e_{i,2}   \\
&&
\end{eqnarray*}    
} % End parbox
&
$cov\left( \begin{array}{c} F_{i,1} \\ F_{i,2} \end{array} \right)
= \left( \begin{array}{c c}
        \phi_{11} & \phi_{12}  \\
        \phi_{12} & \phi_{22}
        \end{array} \right)$ &
$cov\left( \begin{array}{c} e_{i,1} \\ e_{i,2} \end{array} \right)
= \left( \begin{array}{c c}
        \omega_1  & 0  \\
            0     & \omega_2
        \end{array} \right)$
\end{tabular}

\noindent
To make this concrete, it would be natural for psychologists to be interested in the correlation between intelligence and self-esteem, but what they want to know is the correlation between \emph{true} intelligence and \emph{true} self-esteem, not just the between score on an IQ test and score on a self-esteem questionnaire. So for subject $i$, let $F_{i,1}$ represent true intelligence and $F_{i,2}$ represent true self-esteem, while $D_{i,1}$ is the subject's score on an intelligence test and $D_{i,1}$ is score on a self-esteem questionnaire.
    \begin{enumerate}
        \item Make a path diagram of this model.
        \item Show that $|Corr(D_{i,1},D_{i,2})| \leq |Corr(F_{i,1},F_{i,2})|$. That is, measurement error weakens (attenuates) the correlation.
        \item Suppose the reliability of $D_{i,1}$ is $\rho^2_1$ and the reliability of $D_{i,2}$ is $\rho^2_2$. How could you apply $\rho^2_1$ and $\rho^2_2$ to $Corr(D_{i,1},D_{i,2})$, to obtain $Corr(F_{i,1},F_{i,2})$?
        \item You obtain a sample correlation between IQ score and self-esteem score of $r = 0.25$, which is disappointingly low. From other data, the estimated reliability of the IQ test is $r^2_1 = 0.90$, and the estimated reliability of the self-esteem scale is $r^2_2 = 0.75$. Give an estimate of the correlation between true intelligence and true self-esteem. My answer is 0.304. 
% 0.25 / sqrt(0.9*0.75) = 0.3042903
    \end{enumerate}

\newpage %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

% 2015 Final, but fixed up in 2023.
\item This is a simplified version of the situation where one is attempting to ``control" for explanatory variables that are measured with error. People do this all the time, and it doesn't work. Independently for $i=1, \ldots, n$,  let
\begin{eqnarray*}
   Y_i &=& \beta_1 X_{i,1} + \beta_2 X_{i,2} + \epsilon_i   \\
   W_i &=& X_{i,1} + e_i,   
\end{eqnarray*}    
where $cov\left( \begin{array}{c} X_{i,1} \\ X_{i,2} \end{array} \right)
= \left( \begin{array}{c c}
        \phi_{11} & \phi_{12}  \\
        \phi_{12} & \phi_{22}
        \end{array} \right)$, 
$Var(\epsilon_i) = \psi$, $Var(e_1) = \omega$, all the expected values are zero, and the error terms $\epsilon_i$ and $ e_i$ are independent of one another, and also independent of $X_{i,1}$ and $X_{i,2}$. The variable $X_{i,1}$ is latent, while the variables $W_i$, $Y_i$ and $X_{i,2}$ are observable. What people usually do in situations like this is fit a model like 
$Y_i = \beta_1 W_i + \beta_2 X_{i,2} + \epsilon_i$, and test $H_0: \beta_2 = 0$. That is, they ignore the measurement error in variables for which they are ``controlling." The usual fixed-$x$ estimator is
\begin{displaymath}
    \widehat{\beta}_2 = 
\frac{\sum_{i=1}^nW_i^2 \sum_{i=1}^nX_{i,2}Y_i - \sum_{i=1}^nW_iX_{i,2}\sum_{i=1}^nW_iY_i}
     {\sum_{i=1}^n W_i^2 \sum_{i=1}^n X_{i,2}^2 - (\sum_{i=1}^nW_iX_{i,2})^2 }
\end{displaymath}

    \begin{enumerate}
        \item $\widehat{\beta}_2$ converges in probability to a definite target. Give the target in terms of the model parameters. Remember that if $E(X)=0$, then $E(X^2)=Var(X)$.  This means you can use rules about variances to make some of the calculations easier.
        \item The target is a fairly complicated expression, but if it's correct, it should reduce to $\beta_2$ when $\omega=0$ (no measurement error). Verify this.
        \item Now let $\omega>0$ as before, and suppose that $H_0: \beta_2 = 0$ is true. Does the $\widehat{\beta}_2$ converge to the true value of $\beta_2 = 0$ as $n \rightarrow \infty$ everywhere in the parameter space? Answer Yes or No.
        \item Under what conditions (that is, for what values of other parameters) does 
$ \widehat{\beta}_2 \stackrel{p}{\rightarrow} 0$ when  $\beta_2 = 0$?
    \end{enumerate} 

\newpage %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

% 2015 HW 
\item Finally we have a solution, though as usual there is a little twist. Independently for $i=1, \ldots, n$, let
\begin{eqnarray*}
    Y_{i~~} &=& \beta X_i + \epsilon_i   \\
    V_{i~~} &=& Y_i + e_i      \\
    W_{i,1} &=& X_i + e_{i,1}  \\ 
    W_{i,2} &=& X_i + e_{i,2}
\end{eqnarray*}
where 
\begin{itemize}
     \item $Y_i$ is a latent variable.
     \item $V_i$, $W_{i,1}$ and $W_{i,2}$ are all observable variables.
     \item $X_i$ is a normally distributed \emph{latent} variable with mean zero and variance $\phi>0$.
     \item $\epsilon_i$ is normally distributed with mean zero and variance $\psi>0$.
     \item $e_{i}$ is normally distributed with mean zero and variance $\omega>0$.
     \item $e_{i,1}$ is normally distributed with mean zero and variance $\omega_1>0$.
     \item $e_{i,2}$ is normally distributed with mean zero and variance $\omega_2>0$.
     \item $X_i$, $\epsilon_i$, $e_i$, $e_{i,1}$ and $e_{i,2}$ are all independent of one another.
\end{itemize}
        \begin{enumerate}
            \item Make a path diagram of this model.
            \item What is the parameter vector $\boldsymbol{\theta}$ for this model? 
            \item Does the model pass the test of the Parameter Count Rule? Answer Yes or No and give the numbers.
            \item Calculate the variance-covariance matrix of the observable variables as a function of the model parameters. Some of the variances and covariances you can just write down. For the others, show your work.
            \item Is the parameter vector identifiable at every point in the parameter space? Answer Yes or No and prove your answer.  
            \item Some parameters are identifible, while others are not. Which ones are identifiable?
            \item If $\beta$ (the paramter of main interest) is identifiable, propose a Method of Moments estimator for it and prove that your proposed estimator is consistent. 
            \item Suppose the sample variance-covariance matrix $\widehat{\boldsymbol{\Sigma}}$ is
\begin{verbatim}
                  W1    W2     V
            W1 38.53 21.39 19.85
            W2 21.39 35.50 19.00
            V  19.85 19.00 28.81
\end{verbatim}
Give a reasonable estimate of $\beta$. There is more than one right answer. The answer is a number. (Is this the Method of Moments estimate you proposed? It does not have to be.) \textbf{Circle your answer.}

            \item Describe how you could re-parameterize this model to make the parameters all identifiable, allowing you do maximum likelihood.
        \end{enumerate}




\end{enumerate} % End of all the questions


\end{document} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%


\begin{comment}


                \begin{enumerate}
                    \item 
                    \item 
                \end{enumerate}


\vspace{3mm}
\noindent
\textbf{Please bring a printout of your full R input and output for Question \ref{smokeR} to the quiz.}

\end{comment}




% Call this random explanatory -- or maybe wait until identifiability. 

\item Independently for $i=1, \ldots, n$, let $\mathbf{y}_i = \boldsymbol{\beta}_0 + \boldsymbol{\beta}_1 \mathbf{x}_i +   \boldsymbol{\epsilon}_i$, where
\begin{itemize}
    \item $\mathbf{y}_i$ is an $q \times 1$ random vector of observable response variables; there are $q$ response variables. 
    \item $\mathbf{x}_i$ is a $p \times 1$ observable random vector; there are $p$ explanatory variables. $E(\mathbf{x}_i) = \boldsymbol{\mu}_x$ and $cov(\mathbf{x}_i) = \boldsymbol{\Phi}_{p \times p}$. The positive definite matrix $\boldsymbol{\Phi}$ is unknown. 
    \item $\boldsymbol{\beta}_0$ is a  $q \times 1$ matrix of unknown constants.
    \item $\boldsymbol{\beta}_1$ is a  $q \times p$ matrix of unknown constants.  
    \item $\boldsymbol{\epsilon}_i$ is a $q \times 1$ random vector with expected value zero and unknown positive definite variance-covariance matrix $cov(\boldsymbol{\epsilon}_i) = \boldsymbol{\Psi}_{q \times q}$.  
    \item $\boldsymbol{\epsilon}_i$ is independent of $\mathbf{x}_i$.
\end{itemize}
Letting $\mathbf{d}_i = \left(\begin{array}{c} \mathbf{x}_i  \\ \hline  \mathbf{y}_i \end{array} \right)$, we have 
$cov(\mathbf{d}_i) = \boldsymbol{\Sigma} = \left( \begin{array}{c|c}
                \boldsymbol{\Sigma}_x & \boldsymbol{\Sigma}_{xy} \\ \hline
                \boldsymbol{\Sigma}_{yx} & \boldsymbol{\Sigma}_y    
                                                 \end{array} \right)$, and 
$\widehat{\boldsymbol{\Sigma}} = \left( \begin{array}{c|c}
        \widehat{\boldsymbol{\Sigma}}_x & \widehat{\boldsymbol{\Sigma}}_{xy} \\ \hline
        \widehat{\boldsymbol{\Sigma}}_{yx} & \widehat{\boldsymbol{\Sigma}}_y    
                                                 \end{array} \right)$.
    \begin{enumerate}
        \item Give the dimensions (number of rows and columns) of the following matrices: \\
$\mathbf{d}_i$, $\boldsymbol{\Sigma}$, $\boldsymbol{\Sigma}_{x}$, $\boldsymbol{\Sigma}_{y}$, $\boldsymbol{\Sigma}_{xy}$, $\boldsymbol{\Sigma}_{yx}$.
        \item Write the parts of $\boldsymbol{\Sigma}$ in terms of the unknown parameter matrices.
        \item Give a Method of Moments Estimator for $\boldsymbol{\Phi}$. Just write it down. 
        \item Obtain formulas for the Method of Moments Estimators of $\boldsymbol{\beta}_1$, $\boldsymbol{\beta}_0$ and $\boldsymbol{\Psi}$. Show your work. You may give $\widehat{\boldsymbol{\beta}}_0$   in terms of $\widehat{\boldsymbol{\beta}}_1$, but simplify $\widehat{\boldsymbol{\Psi}}$.
        \item If the distributions of $\mathbf{x}_i$ and $\boldsymbol{\epsilon}_i$ are multivariate normal, how do you know that your Method of Moments estimates are also the MLEs?
    \end{enumerate} % End of multivariate regression question