STA 999f 2099 Home Page

STA 378H5: Research Project

University of Toronto Mississauga, Fall 2016

Assignments
1. Assignment 1: Due at our first meeting
  1. This was in my email of April 26th. Download and install LaTeX on the computer you'll be using. To make sure it's working properly, take the code for Assignment 1 from the STA 431 website and run it on your computer. You don't have to understand the code yet; I can explain it when we meet.
  2. We will start with this familiar example. It's a simple regression with one latent explanatory variable X and one latent response variable Y. There's double measurement on both variables, and the measurements are fully equivalent. That is, the variances and covariances of e1 and e2 are equal to those of e3 and e4.
    1. Make a path diagram.
    2. Write the model equations in scalar form.
    3. Calculate Σ. Use the order of variables W₁, W₂, V₁, V₂. This turns out to be easier to look at and work with.
2. Assignment 2
  1. Download and install Sage. The last 3 pages of Appendix B should be helpful.
  2. Check out Testing structural equation models by Kenneth Bollen and J. Scott Long from the library. It's at Robarts. If you can obtain an electronic copy it's even better, but I cannot find one. You will need your student ID card. Please read the introduction and think about it.
  3. In LaTeX, make a list of the measures of model fit that proc calis produces by default. This is just a list of names. You will eliminate some of them and add to the document later. Print the list and bring the single sheet of paper to our meeting. Use an itemize environment, like this:
```
\begin{itemize}
    \item 
    \item 
    \item 
\end{itemize}
```
  4. We need to know what these fit measure are. In particular, we need to how people use them to decide whether the model fits the data well enough to proceed with estimation and testing. For starters, just locate
    - A formula for computing the index from sample data.
    - The cutoff value for deciding whether the model is okay. There may be more than one suggestion.
    Do this for as many of the indices as you can, and report back. A good place to start is the proc calis manual. A copy is posted on the STA431 website. The Wikipedia might be useful too, and I will also give you some textbooks in pdf format.
3. Assignment 3: For Wed. May 24
  1. Read Appendix B about Sage, pages 207-267. The rest of the appendix is documentation of the individual functions in the sem package. It's not done yet, but in the sage environment if you type the name of the function followed by a question mark there's a decent help file. The part I have not finished yet is typesetting these help files (maybe enhanced a bit) and puttting them in the textbook.
  2. Get your Sage notebook interface working so you can see Greek letters.
  3. Look in some of the pdf textbooks including Bollen, looking for (i) How they suggest assessing model fit, and (ii) Specifically, suggested criteria for SMSEA. Please stick to the textbooks, avoiding Google search for now. The reason is that the textbooks have context; if you don't know what they are talking about you can find it earlier in the book. Be ready to report back.
4. Assignment 4: For Wed. May 31
  1. Get that pointer integration (or whatever it's called) figured out so you can copy-paste between SageMath and Windows. This is important.
  2. Try to set up the shared folder between Windows and the virtual machine. By "try" I mean give it a try and if it doesn't work don't worry about it. Don't spend too much time on this.
  3. Produce another typeset document. This time it's in a 12 point font size rather than 10.
    - Start out by saying "The general linear structural equation model in centered form is ..." and then lift material from the STA431 formula sheet.
    - Then say "The following indices are among those proposed as alternative to the chi-squared test for assessing model fit." Then make an itemize list with RMSEA as the only item. We'll fill in others later. Using the proc calis manual (that's a good source), give the full definition, but use our notation and assume only one group (so that's k=1 in the notation of the manual.) This should be pretty complete, so we don't have to look up RMSEA any more. Include as many suggestions as you can find for what's a good value. Say who said it and write down the exact source so we can find it again, but you don't need to include a bibliography yet. For example, p. 144 of Testing Structural Equation Models has some more suggestions.
    - Print hard copy and bring it to our meeting.
  4. Play with n and corr in SimpleDoubleSim1A.sas. Try to locate values where the chi-squared statistic is significant, test of H₀: β₁=0 is significant, and RMSEA (Root Mean Square Error of Approximation) is small -- under 0.09 for sure, under 0.05 if possible.
5. Assignment 5: For Wed. June 7th
  1. Continue trying to find a way to print from Sage. Saving a pdf to the shared folder is a good way, if you can find the shared folder in the linux environment. Another way is to connect your Windows browser to localhost, something we had trouble doing at our most recent meeting. If you are not using Firefox, try that. Another thing to try, regardless of the browser, is connecting to http://localhost:8080/home/admin rather than just http://localhost:8080.
  2. Play with n and corr in SimpleDoubleSim1A.sas. Try to locate values where the chi-squared statistic is significant, test of H₀: β₁=0 is significant, and RMSEA (Root Mean Square Error of Approximation) is small -- under 0.09 for sure, under 0.05 if possible. That is, we're looking for the situation where RMSEA gives us the green light but the chi-squared test detects lack of fit and sure enough, the test for X → Y is significant (Type I error).
  3. This question is paper and pencil. It is setting up another way to obtain the large-sample target of the restricted MLE. Bring your answers to the meeting. Let Y ~ P_θ. The statistic T = T(Y) is said to be sufficient for θ if the conditional distribution of Y given T does not depend on θ. This means that if you're interested in θ, T is all you need to know; it's enough (sufficient).
    1. State the Neyman factorization theorem. It's okay to copy it from a textbook.
    2. Give a sufficient statistic for (μ,Σ) based on a random sample from a multivariate normal distribution.
    3. Give a sufficient statistic for (μ,σ²) based a random sample from a univariate normal distribution.
    4. Give another sufficient statistic for the parameters of a univariate normal distribution based on a random sample, showing that sufficient statistics are not unique.
    5. Let X₁, ..., X_n be a random sample from a distribution that is uniform on (0,θ). Find a sufficient statistic for θ.
    6. Show that the MLE depends on the sample data only through the value of a sufficient statistic. Hint: differentiate.
    7. Show that the posterior distribution depends on the sample data only through the value of a sufficient statistic. You may assume that all the distributions are continuous. Hint: integrate.
    8. We would like to say that the MLE is a function of a sufficient statistic. Why must the MLE be unique in order for this to be true?
    9. Suppose that a vector of sample moments M_n is a sufficient statistic for θ, as in the normal distribution, and that the (unique) MLE is not just a function of M_n, but a continuous function g(M_n). This is certainly the case in the toy example we have been studying, but notice that nobody said we have an explicit formula for g, just that it's continuous. Under this assumption, what is the large-sample target of the MLE?
    10. Remember how you can give proc calis just the sample covariance matrix and the sample size rather than the raw data? This is a common feature in structural equation modeling software, and has led to the following proposal. If you want to know the large-sample target of the restricted MLE for a particular set of numerical parameter values, calculate Σ and pretending that it's Sigma-hat, give that matrix to your software along with some sample size (your choice) and fit the restricted model. Your MLEs will be the large-sample targets, supposedly. Here's the question. What is the connection of this idea to the last question (5.c.ix)?
    11. How could we use this idea to obtain the non-centrality parameter for the goodness of fit test directly from the SAS output?
    12. If you have any comments or questions, please write them down so we can discuss.
6. Assignment 6: For Wed. June 14th. Assuming the mis-specified model with ω₁₂=0 and also assuming the null hypothesis β=0, use Sage to obtain
  1. Explicit formulas for the restricted MLEs.
  2. The large-sample targets of these restricted MLEs under the true model (you may not get to this).
  Bring the printout to the meeting. Though I do not promise to read my email promptly, please feel free to email me with any questions. Use my code wherever possible.
7. Assignment 7: For Wed. June 21st. Consider the likelihood ratio test of H₀: β=0 under the mis-specified model in which ω₁₂=0. We are interested in what happens to the test statistic as n → ∞.
  1. We have an expression for the (minus) log likelihood, and we have expressions for the MLE under both the full mis-specified model and the restricted mis-specified model. Using Sage, obtain an explicit formula for the test statistic in terms of n and s_ij quantities, simplified as much as possible.
  2. Unless I am seriously confused, the answer should be the sample size n multiplied by a function of the s_ij sample variances and covariances. Set n aside and using Sage, find the large-sample target of that second part under the true model. The answer is in terms of the Greek-letter parameters. Simplify. Note that at this point, we still have
Readings
1. Appendix A: Review and background
2. Appendix B: Sage
Code
1. SimpleDoubleSim1A.sas
2. SimpleDouble1.sage.txt: Explicit formulas for the MLEs under the mis-specified model, large-sample targets of the (incorrect) MLEs under the true model, and large-sample target of (incorrect) f_min under the true model.
3. PlotPoweR.txt: R code for plotting power functions of the chi-squared test of fit.