About the STA442/1008 Final Exam

STA442/1008 Final Exam Information and Assignment

This page is complete.

I have finished making up the final exam. The last two pieces of advice I can think of are these. First, be guided by this assignment. Apart from material associated with Chapter one, if it's not on here then don't worry about it. For example, notice that there is no sample size selection in this assignment. That means it would be unfair for me to ask about it on the final.

Finally, the most important thing in the course and on the final is interpretation. What do the results mean? You must be able to express conclusions in simple language that could be understood by someone with no training in Statistics.

The objective of statistical analysis is not numbers; it is understanding.

The final examination will be on Saturday Dec. 17th in SE2068, from 8-11 a.m. It will be closed book and closed notes. Any formulas you need (for example, the ones relating F to a) will be supplied. You will need a calculator; please bring one. For example, suppose you were asked to provide predicted Y values from proc reg output (hint). You could do this by hand, but you would not want to.

The final exam will be much like the quizzes, only longer. Links to several new datasets are given below, and you should analyze the data in preparation for the final. However, you will not bring your printouts to the exam. Instead, I will do the analyses and provide pieces of my program files and output, and you will answer questions based on my output. The idea is that if you have been working with the data already, you will be much faster and more focused on the exam, because you will have done most of the thinking already. Also, though I will provide my SAS programs on the final, there will not be a lot of background information about the datasets. It is expected that you will have become familiar with the data sets in the course of preparing for the final.

In reponse to a question, you will not necessarily see all these data sets on the final. But you will not see any SAS output that is not based on these datasets.

There are several data sets. You can see the Computer Hints page for instructions on copying them directly from my tuzo account to yours. This might be helpful for the longer data files.

Extra Office Hours:

Christine: Monday Dec 12, 1-3pm.
Jerry: Wednesday Dec 14 and Friday Dec 16, 10-11:30am

Please review Assignment One, and look at the class notes as well as at the online text. There will be a least one question of the "make up a study" variety. It will certainly involve techniques you encountered later in the course, and not just the elementary tests. For example,
- A three-factor analysis of covariance with two between-subjects factors and one within-subjects factor, and one quantitative covariate.
- A multiple regression with repeated measures.
And that sort of thing. Of course it will not be these exact examples.
And another thing is please make your examples somewhat believable. The idea here is not just to show me that you understand the meaning of the concepts, but also that you see how to use statistical methods as tools for answering real questions about the world. That actually is the whole point of this course. Well, if that does not convince you, maybe I should try a threat. If you have something like sex as an experimentally manipulated within-subjects factor by repeated sex change operations, you will lose marks.
Expect a set of true-false questions; you will need to get something like 8 or 9 out of 10 right in order to get any credit at all. To prepare, look at Assignment One, but also make up some true-false questions about material later in the course and try them out on your classmates, your children or your dog. For example, answer True or False:
- When you add an independent variable to a multiple regression, R²
- A matched t-test is an example of a repeated measures analysis of variance
- In the usual multivariate regression, each dependent variable has the same set of independent variables
- In effect coding, the last categoy is the reference category.
- Suppose you did 2 t-tests and an F-test with proc reg, and then a chisquare test with proc freq, on a different data set. You could protect all 4 tests against Type I Error using a Bonferroni correction, if you decided on the tests before looking at the data.
See the Titanic Data. First, explore the data set and become familiar with it. Be able to answer questions like these: Were there any female crew? How many third-class female children survived? How many second-class children perished?
There is one natural dependent variable and three natural independent variables. It's interesting to examine relationships between the DV and each IV controlling for the other two, but if you subdivide the data too much the analysis becomes comlicated by small or zero cell frequencies. So, after some fooling around, I decided to limit the analysis for the final exam to the following questions:
1. For just adults, is survival related to class when you control for sex?
2. For just adults, is survival related to sex when you control for class?
3. For just children, is survival realted to class when you ignore sex?
4. For just children, is survival realted to sex when you ignore class?
We will not worry about expected frequencies less than 5, just about those less than 1.
See the Body Fat Data. The file starts with some explanatory material. You can make a copy of the file and strip off the part at the front to make a data file. You should start by correcting the errors described in the explanatory material. Use your best judgement.
The idea here is that one can obtain the density of a person's body by immersing him or her in liquid and measuring the volume of the liquid that is displaced, but this is quite inconvenient. We would much rather make some measurements with a scale and a measuring tape.
There are two estimates of percent body fat based on body density; they are based on Brozek's equation and Siri's equation, and they usually give very similar results. Average the two body fat estimates. This is the dependent variable.
Now develop an equation for predicting percent body fat from the other measurements in the file. There are several variables that you should not use as predictors; you should be able to figure out which ones they are.
I am going to do a stepwise regression, the one combining forward and backward steps. I will use the default settings. In my opinion, it is not so important that we have exactly the same numerical answers on this one. On the other hand, at each step you should be able to say what tests are being performed (state the full and reduced model for each one), and how the routine decides what to do next (enter a variable, delete a variable, or stop). The end result is a prediction equation. If I give you a set of measurements, you should be able to give me an (estimated) percent body fat.
Consider the Cars Data. It has length, weight, origin and fuel efficiency in kilometers per litre, for a sample of cars. The three origins are US=1, Japanese=2 and European=3. Presumably these refer to the location of the head office, not to where the car was manufactured.
I am interested in testing whether fuel efficiency is related to country of origin, first ignoring weight and length, and then controlling for weight and length. In each case, if the initial test is significant, I plan to follow up with all pairwise comparisons, Bonferroni corrected. When we're controlling for the covariates, the sample means no longer give us an adequate picture of what's going on: Why? Instead, we'll look at predicted Y (y-hat) values, with the independent variables set to their sample mean values. That's the sample mean for the entire data set, not just that group. Make a table; the rows are country. For each country, generate a predicted Y. Call the predicted Ys "corrected means." You will need a calculator, for sure. I am going to use cell means coding, so you should too. Of course there is no harm in doing it in another way too, as a check. For Japanese cars, my corrected mean is 7.63.
I'll also test weight and length controlling for country -- why not? For all such questions, what's the full model, the reduced model? What, if anything, do the t-tests mean?
In plain language, what do you conclude from this analysis?
Exercise tolerence data: Subjects aged 25-35 were classified according to three factors.
1. Sex: 1=Male 2=Female
2. Body Fat: 1=Low 2=High
3. Smoking history: 1=Light 2=Heavy
Exercise tolerence was measured in minutes until fatigue occurs while pedaling a stationary exercise bicycle. What should you do with these data? Hummm ....
Farm Data: In an agricultural experiment, the cases are 10 farm fields. Five fields were randomly assigned to one irrigation method, and 5 were assigned to the other method. Then each field was subdivided into two plots of equal size. Within each field, one plot was randomly assigned to Fertilizer Type One, and the other plot was assigned to Fertilizer Type Two. The dependent variable is crop yield.
Of course the plots from a given field are quite similar in important respects like soil fertility and average moisture; they definitely should not be considered independent. We are interested in testing for
- Main effect of irrigation method
- Main effect of fertilizer type
- Whether the effect of irrigation method depends on fertilizer type
- Whether the effect of fertilizer type depends on irrigation method
Here are the data:
```
---------------------------------------------------------------------
Field                       1   2   3   4   5       6   7   8   9  10
---------------------------------------------------------------------
Irrigation method           1   1   1   1   1       2   2   2   2   2
---------------------------------------------------------------------
Yield with Fertilizer 1    43  40  31  27  36      63  52  45  47  54
Yield with Fertilizer 2    48  43  36  30  39      70  53  48  51  57
---------------------------------------------------------------------
```
I'm going to do this with both proc glm and proc mixed. In proc mixed, I'm going to do it two ways: with the cs covariance structure and the un covariance structure. Compare the results!
Recall the shoe data from Assignment 11. Please test both main effects and the interaction using proc mixed. Then follow up the main effect for period with all pairwise comparisons of marginal means. Using a Bonferroni correction, what (if anything) do you conclude?
To check your work on this one, do a trial run with the compound symmetry covariance structure. Your F tests will be exactly the same as the "Univariate Tests of Hypotheses for Within Subject Effects" from your Assignment 11 proc glm output. The between subjects F will be identical too. Then edit your program to make type = un and re-run it.
In the wine study, 6 judges rated 4 wines on a scale from 0 to 40, presented in random order and with a double blind. Here are the data:
```
              Wine
Judge      1   2   3   4
-------------------------
  1       20  24  28  28
  2       15  18  23  24
  3       18  19  24  23
  4       26  26  30  30
  5       22  24  28  26 
  6       19  21  27  25
```
The cases are judges. We want to know if there is a significant difference in the rated quality of the 4 wines. If there is a difference, we want to know which wines are different from which other ones. I corrected an error in the data on Dec. 5th.
For these data, I am fairly comfortable with a compound symmetry assumtion, which says essentially that ratings of different wines are correlated because each judge has a personal tendency to rate everything relatively a bit high or a bit low. Now, the your Bonferroni-corrected pairwise comparisons should use the compound symmetry assumption too. This means that a bunch of ordinary matched t-tests are out, because the compound symmetry assumption implies that the population variance of each difference between means must be the same, and should be estimated using all the data -- not just the data used to calculate the sample difference between means. The easiest way to do it is with proc mixed. So, although you could get the overall F-test from either proc glm or proc mixed, I am going to stick to proc mixed, and you should too. I get an overall F = 57.5.

Here are a few more miscellaneous hints and commments:

Any time there is a univariate F or t-test, there is the opportunity to ask how much of the remaining variation the effect explains. I will certinly ask this at least a couple of times. Formulas will be provided.
For any proc mixed run with repeated measures, there is an interesting test for correlation among the observations coming from the same subject. Can you find it in your output?
Vocabulary is important. Know the buzzwords. They are a useful shorthand for some of the main ideas in the course.
But be able to turn the vocabulary off at will. If you cannot express a technical idea in plain English, then you don't really understand it. This principle goes far beyond Statisitcs. So, when you are asked to state conclusions in plain language, beware of using statistical vocabulary and above all say what happened. Was there higher crop yield with Fertilizer B? Say it!