Applied Statistics II Plan

Applied Statistics II Course Plan

This course will consist of a collection of applied topics, tied together by a common theme: How should simulation data be analyzed? Statisticians frequently use simulation studies to investigate the behaviour of statistical methods under various circumstances. Simulations are empirical studies producing quantitative data. One would expect the analysis of such data to be a showcase for appropriate statistical methods, but usually this is far from being the case. We will try to do better.

The first major topic is choice of sample size, by traditional power analysis and other methods. In the case of simulation studies, this means Monte Carlo sample size. We'll look at the normal linear model in some detail, and then broaden the discussion. It is curious that power analysis is quite easy to understand, but it is often difficult to carry out in practice. We will identify the difficulties, and partially overcome them.

For some models, simulation can be a good (or even the only) way to conduct a power analysis. We will consider how best to do this. The first section will end with a discussion of how to estimate power from empirical data, including the question of why anybody would want to, and whether you should even try.

The second major topic is multiple comparisons, with emphasis on Scheffé tests and their generalizations, for example to the case of generalized linear models. The goal is to have a family of tests, all jointly protected against Type I error at a single significance level. Our goal is to develop a Scheffé-like procedure that is appropriate for simulation studies. We will find that the power of such tests may not be all one could wish, but that the difficulties can be overcome with a sufficiently large Monte Carlo sample size. How large? This question requires us to combine ideas from the first two topics.

At this point, the course may be over. If we have more time, we'll look at structural equation models, starting with factor analysis and regression with errors in the independent variables. I have a big simulation study of this latter topic. How should the data be analyzed?

The tools needed for this course are mathematical statistics at the Masters or advanced undergraduate level (a course in regression or linear models would be nice too, but it's not essential), and the S language running in a unix environment. We may also use Mathematica a bit if I can get something to work. There is no textbook, but many pages of handouts will be available on the Web in PDF format.

We will start out with a little assignment designed to let me find out if you can control the kind of tools we need to use. In particular, can you do a couple of problems at the second-year undergraduate level and implement the results in S?

There will be a assignment every week of so. Sometimes it will be handed in, and sometimes there will be a short quiz in class, based on the assignment.