% 431s23Assignment5.tex     Omitted, instrumental variables
\documentclass[11pt]{article} 
%\usepackage{amsbsy} % for \boldsymbol and \pmb 
\usepackage{graphicx} % To include pdf files!
\usepackage{amsmath}
\usepackage{amsbsy}
\usepackage{amsfonts}
\usepackage{comment}
\usepackage[colorlinks=true, pdfstartview=FitV, linkcolor=blue, citecolor=blue, urlcolor=blue]{hyperref} % For links
\usepackage{fullpage}
%\pagestyle{empty} % No page numbers


\begin{document}

%\enlargethispage*{1000 pt} 


\begin{center}   
{\Large \textbf{STA 431s23 Assignment Five}}\footnote{This assignment was prepared by  \href{http://www.utstat.toronto.edu/~brunner}{Jerry Brunner},
Department of Statistical Sciences, University of Toronto. It is licensed under a 
\href{http://creativecommons.org/licenses/by-sa/3.0/deed.en_US}
     {Creative Commons Attribution - ShareAlike 3.0 Unported License}. Use any part of it as you like and share the result freely. The \LaTeX~source code is available from the course website:
\href{http://www.utstat.toronto.edu/brunner/oldclass/431s23} {\small\texttt{http://www.utstat.toronto.edu/brunner/oldclass/431s23}}}
\vspace{1 mm}
\end{center}

\noindent
\emph{For the Quiz on Friday Feb.~17th, please bring a printout of your full R input and output for Question~\ref{smokeR}. The other problems are not to be handed in. They are practice for the Quiz.}
\vspace{2mm}
\hrule

\begin{enumerate} 


\item \label{omittedvars} In the following regression model, the explanatory variables $X_1$ and $X_2$ are random variables. The true model is 
\begin{displaymath}
    Y_i = \beta_0 + \beta_1 X_{i,1} + \beta_2 X_{i,2} + \epsilon_i,
\end{displaymath}
independently for $i= 1, \ldots, n$, where $\epsilon_i \sim N(0,\sigma^2)$. 

The mean and covariance matrix of the explanatory variables are given by
\begin{displaymath}
    E\left( \begin{array}{c} X_{i,1} \\ X_{i,2} \end{array} \right) =
     \left( \begin{array}{c} \mu_1 \\ \mu_2 \end{array} \right)
     \mbox{~~ and ~~}
    Var\left( \begin{array}{c} X_{i,1} \\ X_{i,2} \end{array} \right) = 
 \left( \begin{array}{rr}
\phi_{11} & \phi_{12} \\ 
\phi_{12} & \phi_{22}
\end{array} \right) 
\end{displaymath}

Unfortunately $X_{i,2}$, which has an impact on $Y_i$ and is correlated with $X_{i,1}$, is not part of the data set. Since $X_{i,2}$ is not observed, it is absorbed by the intercept and error term, as follows.
\begin{eqnarray*}
    Y_i &=& \beta_0 + \beta_1 X_{i,1} + \beta_2 X_{i,2} + \epsilon_i \\
        &=& (\beta_0 + \beta_2\mu_2) + \beta_1 X_{i,1} + (\beta_2 X_{i,2} - \beta_2 \mu_2 + \epsilon_i) \\
        &=& \beta^\prime_0 + \beta_1 X_{i,1} + \epsilon^\prime_i.
\end{eqnarray*}
The last line represents the `true" model. The  primes just denote a new $\beta_0$ and a new $\epsilon_i$. It was necessary to add and subtract $\beta_2 \mu_2$ in order to obtain $E(\epsilon^\prime_i)=0$. And of course there could be more than one omitted variable. They would all get swallowed by the intercept and error term, the garbage bins of regression analysis.

\begin{enumerate}
     \item Make a path diagram of the model with $X_{i,1}$ and $X_{i,2}$.
     \item What is $Cov(X_{i,1},\epsilon^\prime_i)$? 
     \item Make a path diagram of the model with the primes.
     \item Calculate the variance-covariance matrix of $(X_{i,1},Y_i)$ under the true model.
     \item Suppose we want to estimate $\beta_1$. The usual least squares estimator is
\begin{displaymath}
    \widehat{\beta}_1 = \frac{\sum_{i=1}^n(X_{i,1}-\overline{X}_1)(Y_i-\overline{Y})}
                           {\sum_{i=1}^n(X_{i,1}-\overline{X}_1)^2}.
\end{displaymath}
You may just use this formula; you don't have to derive it. Is $\widehat{\beta}_1$ a consistent estimator of $\beta_1$ (meaning for all points in the parameter space) if the true model holds? Answer Yes or No and show your work. Remember, $X_2$ is not available, so you are doing a regression with one explanatory variable. You may use the consistency of the sample variance and covariance without proof.
     \item Are there \emph{any} points in the parameter space for which $\widehat{\beta}_1 \stackrel{p}{\rightarrow} \beta_1$ when the true model holds? 
\end{enumerate}

\item A useful way to write a fixed-$x$ regression model is $y_i = \boldsymbol{\beta}^\top\mathbf{x}_i + \epsilon_i$, where $\mathbf{x}_i$ is a $p \times 1$ vector of constants. Of course usually the explanatory variables are best modeled as random variables. So, the model really should be 
$y_i = \boldsymbol{\beta}^\top \boldsymbol{\mathcal{X}}_i + \epsilon_i$, and the usual model is conditional on $\boldsymbol{\mathcal{X}_i} = \mathbf{x}_i$.

In what way does the usual conditional linear regression model imply that (random) explanatory variables have zero covariance with the error term? For notational convenience, assume $\boldsymbol{\mathcal{X}}_i$ as well as $\epsilon_i$ continuous. What is the conditional distribution of $\epsilon_i$ given $\boldsymbol{\mathcal{X}_i} = \mathbf{x}_i$?

\item In a regression with one explanatory variable, show that $E(\epsilon_i|X_i=x_i)=0$ for all $x_i$ implies $Cov(X_i,\epsilon_i)=0$, so that \emph{a standard regression model without the normality assumption still implies zero covariance} (though not necessarily independence) \emph{between the error term and the explanatory variables.} Hint: If you get stuck, the matrix version of this calculation is in the text. We are in Chapter Zero.

\item Independently for $i=1, \ldots, n$, let $y_i = \beta x_i + \epsilon_i$, where $x_i \sim N(\mu_x,\sigma^2_x)$, and $\epsilon_i \sim N(0,\sigma^2_\epsilon)$. Because of omitted variables, $x_i$ and $\epsilon_i$ are not independent. $Cov(x_i,\epsilon_i)=c$. 
    \begin{enumerate}
        \item The usual fixed-$x$ estimator of $\beta$ is $\widehat{\beta}_n = \frac{\sum_{i=1}^n x_i y_i}{\sum_{i=1}^n x_i^2}$. Is $\widehat{\beta}_n$ a consistent estimator of $\beta$? Answer Yes or No and prove it.
        \item Another estimator you have seen before is $\tilde{\beta}_n = \frac{\overline{y}_n}{\overline{x}_n}$. Suppose $\mu_x \neq 0$. Do we have $\tilde{\beta}_n \stackrel{p}{\rightarrow} \beta$? Answer Yes or No and show your work. 
    \end{enumerate}

\item \label{generalIV} The following is the general instrumental variables regression model for observed variables. 
Independently for $i=1, \ldots, n$, 
\begin{displaymath} 
    \mathbf{y}_i = \boldsymbol{\beta}_0 + \boldsymbol{\beta}_1 \mathbf{x}_i +   \boldsymbol{\epsilon}_i, \mbox{ where}
\end{displaymath} 
% {\footnotesize
    \begin{itemize}
        \item $\mathbf{y}_i$ is an $q \times 1$ random vector of observable response variables, so the regression is multivariate; there are $q$ response variables. 
        \item $\mathbf{x}_i$ is a $p \times 1$ observable random vector; there are $p$ explanatory variables. 
        \item $E(\mathbf{x}_i) = \boldsymbol{\mu}_x$ and $cov(\mathbf{x}_i) = \boldsymbol{\Phi}_x$. 
        \item $\boldsymbol{\beta}_0$ is a  $q \times 1$ vector of unknown constants. 
        \item $\boldsymbol{\beta}_1$ is a  $q \times p$ matrix of unknown constants. These are the regression coefficients, with one row for each response variable and one column for each explanatory variable. 
        \item $\boldsymbol{\epsilon}_i$ is a $q \times 1$ unobservable random vector with expected value zero and unknown variance-covariance matrix $cov(\boldsymbol{\epsilon}_i) = \boldsymbol{\Psi}$.  
        \item $cov(\mathbf{x}_i,\boldsymbol{\epsilon}_i) = \pmb{\c{C}}$, a $p \times q$ matrix of covariances that arise from omitted variables. 
        \item There are at least $p$ instrumental variables. Put the best $p$ in the random vector $\mathbf{z}_i$.
        \item  $E(\mathbf{z}_i)=\boldsymbol{\mu}_z$ and $cov(\mathbf{z}_i)=\boldsymbol{\Phi}_z$.
        \item $cov(\mathbf{x}_i,\mathbf{z}_i)=$ ${\LARGE\boldsymbol{\kappa}}$, $p \times p$ matrix of covariances. Assume ${\LARGE\boldsymbol{\kappa}}$ has an inverse.
    \end{itemize}
% } % End size

\newpage %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\enlargethispage*{1000 pt}

    \begin{enumerate}
        \item Calculate the expected value and variance-covariances matrix of 
        $\left( \begin{array}{c} 
    \mathbf{x}_i \\ \hline  
    \mathbf{y}_i \\ \hline  
    \mathbf{z}_i   \end{array} \right)$ in terms of the model parameters. Your answers are partitioned matrices.
        \item Writing $\boldsymbol{\Sigma} = 
        \left( \begin{array}{c|c|c}
        \boldsymbol{\Sigma}_{11} & \boldsymbol{\Sigma}_{12} &     
        \boldsymbol{\Sigma}_{13}  \\ \hline
        \boldsymbol{\Sigma}_{21} & \boldsymbol{\Sigma}_{22} & \boldsymbol{\Sigma}_{23} \\ \hline
        \boldsymbol{\Sigma}_{31} & \boldsymbol{\Sigma}_{32} & \boldsymbol{\Sigma}_{33}
        \end{array} \right)$, 
                \begin{enumerate}
                    \item \label{beta1tilde} Solve for the crucial parameter matrix $\boldsymbol{\beta}_1$ in terms of $\boldsymbol{\Sigma}_{ij}$ matrices.
                    \item Give a method of moments estimator for $\boldsymbol{\beta}_1$.
                    \item Indicate how it is possible to solve for the other parameter matrices that appear in $\boldsymbol{\Sigma}$. You don't have to give the complete solutions in terms of $\boldsymbol{\Sigma}_{ij}$ matrices. For example, you have $\boldsymbol{\Phi}_x = \boldsymbol{\Sigma}_{11}$, and you also have a solution for $\boldsymbol{\beta}_1$. So, you can just write $\pmb{\c{C}} = \boldsymbol{\Sigma}_{12} - \boldsymbol{\Phi}_x\boldsymbol{\beta}_1^\top$. 
                \end{enumerate}
    \end{enumerate}

\item \label{smoke} Instrumental variables are a powerful solution to the problem of omitted variables, but they are not easy to find. One suggestion is that cigarette taxes could be an instrumental variable for testing the connection between smoking and lung cancer. This relationship that is hardly open to question, but still, it is contaminated by many omitted variables -- except in experimental studies where animals are exposed to cigarette smoke in a controlled way. So consider a study in which the $n$ cases are U.S.~states (provinces), and the variables are

    \begin{itemize}
        \item[$z$:] State tax on a pack of cigarettes (there's a federal tax too, but it's the same in all states).
        \item[$x$:] Smoking rate in percent.
        \item[$y_1$:] Age-adjusted rate of new Lung and Bronchus cancers (per 100k population).
        \item[$y_2$:] Age-adjusted rate of new Brain and other nervous system cancers (per 100k population).
    \end{itemize}
The ``age-adjusted" business is some kind of regression correction. I believe we are getting residuals plus a constant. Here is a picture of an instrumental variables model for these data. 

\begin{center}
\includegraphics[width=4in]{SmokingPath}
\end{center}

\newpage %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\begin{center}
\includegraphics[width=4in]{SmokingPath}
\end{center}

\noindent
This model does not stand up to close examination. It has lots of flaws, and listing them (with discussion) would be enlightening.  However, for now let's just pretend to believe it, and proceed with the homework problem.
    \begin{enumerate}
        \item \label{smokemod} Write down the model equations and the other details of the model, following the notation indicated in the path diagram. Use $Var(x) = \phi_x$ and $Var(z) = \phi_z$ The regression equations have intercepts. 
        \item Referring to the general model in Question~\ref{generalIV}, give the following matrices in terms of the model you have just written down: $\mathbf{y}_i$, $\boldsymbol{\epsilon}_i$, $\boldsymbol{\beta}_0$, $\boldsymbol{\beta}_1$, $\boldsymbol{\Psi}$, $\pmb{\c{C}}$, ${\LARGE\boldsymbol{\kappa}}$. Half marks off if you give the transpose.
        \item Calculate $cov\left( \begin{array}{c} x_i \\ y_{i,1} \\  y_{i,2} \\z_i  \end{array} \right)$.
        \item \label{smokemom} Give method of moments estimates of $\alpha_1$ and $\beta_1$. Compare them to your answer to \ref{beta1tilde}.
        \item There are unique solutions for the other parameters as well. How do you know this without doing the calculations? 
        \item For the model you described in Question \ref{smokemod}, what is the parameter vector $\boldsymbol{\theta}$? It should consist only of the unique parameters. How many parameters are there?
        \item How many moments (unique expected values, variances and covariances) are there?
        \item How do you know that for this model, the method of moments estimators are also the maximum likelihood estimators?
    \end{enumerate}

\newpage %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\item \label{smokeR} The data set described in Problem \ref{smoke} is available at
 \begin{center}
       \href{https://www.utstat.toronto.edu/brunner/openSEM/data/CancerTax2.data.txt}
    {\texttt{https://www.utstat.toronto.edu/brunner/openSEM/data/CancerTax2.data.txt}}
\end{center}

    \begin{enumerate}
        \item Fit your model from Problem~\ref{smoke} (meaning estimate the parameters) with \texttt{lavaan}. My standard error for $\widehat{\psi}_{12}$ was 1.048.
        \item Using the \texttt{var} function with \texttt{na.rm=TRUE}, calculate your method of moments estimate of $\alpha_1$ from Problem~\ref{smokemom}. 
            \begin{enumerate}
                \item Does it agree with the MLE?
                \item Are you surprised?
                \item Why does it not matter that \texttt{var} uses $n-1$ in the denominator, while the maximum likelihood estimates use $n$?
            \end{enumerate}
        \item The output of \texttt{summary} includes a test of $H_0: \alpha_1=0$.
            \begin{enumerate}
                \item Give the value of the test statistic. It is a number from your printout.
                \item Give the $p$-value. It is a number from your printout.
                \item In terms of the influence of smoking on cancer (which is the point of all this), what do you conclude from this test? If a conclusion is justified, draw a \emph{directional} conclusion.
            \end{enumerate}
        \item The output of \texttt{summary} includes a test of $H_0: \beta_1=0$.
            \begin{enumerate}
                \item Give the value of the test statistic. It is a number from your printout.
                \item Give the $p$-value. It is a number from your printout.
                \item In terms of the influence of smoking on cancer, what do you conclude from this test? If a conclusion is justified, draw a \emph{directional} conclusion.
            \end{enumerate}
    \end{enumerate}
\end{enumerate} % End of all the questions

\vspace{3mm}
\noindent
\textbf{Please bring a printout of your full R input and output for Question \ref{smokeR} to the quiz.}


\end{document} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%


\begin{comment}


                \begin{enumerate}
                    \item 
                    \item 
                \end{enumerate}


\end{comment}




% Call this random explanatory -- or maybe wait until identifiability. 

\item Independently for $i=1, \ldots, n$, let $\mathbf{y}_i = \boldsymbol{\beta}_0 + \boldsymbol{\beta}_1 \mathbf{x}_i +   \boldsymbol{\epsilon}_i$, where
\begin{itemize}
    \item $\mathbf{y}_i$ is an $q \times 1$ random vector of observable response variables; there are $q$ response variables. 
    \item $\mathbf{x}_i$ is a $p \times 1$ observable random vector; there are $p$ explanatory variables. $E(\mathbf{x}_i) = \boldsymbol{\mu}_x$ and $cov(\mathbf{x}_i) = \boldsymbol{\Phi}_{p \times p}$. The positive definite matrix $\boldsymbol{\Phi}$ is unknown. 
    \item $\boldsymbol{\beta}_0$ is a  $q \times 1$ matrix of unknown constants.
    \item $\boldsymbol{\beta}_1$ is a  $q \times p$ matrix of unknown constants.  
    \item $\boldsymbol{\epsilon}_i$ is a $q \times 1$ random vector with expected value zero and unknown positive definite variance-covariance matrix $cov(\boldsymbol{\epsilon}_i) = \boldsymbol{\Psi}_{q \times q}$.  
    \item $\boldsymbol{\epsilon}_i$ is independent of $\mathbf{x}_i$.
\end{itemize}
Letting $\mathbf{d}_i = \left(\begin{array}{c} \mathbf{x}_i  \\ \hline  \mathbf{y}_i \end{array} \right)$, we have 
$cov(\mathbf{d}_i) = \boldsymbol{\Sigma} = \left( \begin{array}{c|c}
                \boldsymbol{\Sigma}_x & \boldsymbol{\Sigma}_{xy} \\ \hline
                \boldsymbol{\Sigma}_{yx} & \boldsymbol{\Sigma}_y    
                                                 \end{array} \right)$, and 
$\widehat{\boldsymbol{\Sigma}} = \left( \begin{array}{c|c}
        \widehat{\boldsymbol{\Sigma}}_x & \widehat{\boldsymbol{\Sigma}}_{xy} \\ \hline
        \widehat{\boldsymbol{\Sigma}}_{yx} & \widehat{\boldsymbol{\Sigma}}_y    
                                                 \end{array} \right)$.
    \begin{enumerate}
        \item Give the dimensions (number of rows and columns) of the following matrices: \\
$\mathbf{d}_i$, $\boldsymbol{\Sigma}$, $\boldsymbol{\Sigma}_{x}$, $\boldsymbol{\Sigma}_{y}$, $\boldsymbol{\Sigma}_{xy}$, $\boldsymbol{\Sigma}_{yx}$.
        \item Write the parts of $\boldsymbol{\Sigma}$ in terms of the unknown parameter matrices.
        \item Give a Method of Moments Estimator for $\boldsymbol{\Phi}$. Just write it down. 
        \item Obtain formulas for the Method of Moments Estimators of $\boldsymbol{\beta}_1$, $\boldsymbol{\beta}_0$ and $\boldsymbol{\Psi}$. Show your work. You may give $\widehat{\boldsymbol{\beta}}_0$   in terms of $\widehat{\boldsymbol{\beta}}_1$, but simplify $\widehat{\boldsymbol{\Psi}}$.
        \item If the distributions of $\mathbf{x}_i$ and $\boldsymbol{\epsilon}_i$ are multivariate normal, how do you know that your Method of Moments estimates are also the MLEs?
    \end{enumerate} % End of multivariate regression question