STA442/1008 Final Exam Information and Assignment
This page is complete.
I have finished making up the final exam. The last two pieces of advice I can think of are these. First, be guided by this assignment. Apart from material associated with Chapter one, if it's not on here then don't worry about it. For example, notice that there is no sample size selection in this assignment. That means it would be unfair for me to ask about it on the final.
Finally, the most important thing in the course and on the final is interpretation. What do the results mean? You must be able to express conclusions in simple language that could be understood by someone with no training in Statistics.
The objective of statistical analysis is not numbers; it is understanding.
The final examination will be on Saturday Dec. 17th in SE2068, from 8-11 a.m. It will be closed book and closed notes. Any formulas you need (for example, the ones relating F to a) will be supplied. You will need a calculator; please bring one. For example, suppose you were asked to provide predicted Y values from proc reg output (hint). You could do this by hand, but you would not want to.
The final exam will be much like the quizzes, only longer. Links to several new datasets are given below, and you should analyze the data in preparation for the final. However, you will not bring your printouts to the exam. Instead, I will do the analyses and provide pieces of my program files and output, and you will answer questions based on my output. The idea is that if you have been working with the data already, you will be much faster and more focused on the exam, because you will have done most of the thinking already. Also, though I will provide my SAS programs on the final, there will not be a lot of background information about the datasets. It is expected that you will have become familiar with the data sets in the course of preparing for the final.
In reponse to a question, you will not necessarily see all these data sets on the final. But you will not see any SAS output that is not based on these datasets.
There are several data sets. You can see the Computer Hints page for instructions on copying them directly from my tuzo account to yours. This might be helpful for the longer data files.
Extra Office Hours:
And another thing is please make your examples somewhat believable. The idea here is not just to show me that you understand the meaning of the concepts, but also that you see how to use statistical methods as tools for answering real questions about the world. That actually is the whole point of this course. Well, if that does not convince you, maybe I should try a threat. If you have something like sex as an experimentally manipulated within-subjects factor by repeated sex change operations, you will lose marks.
Expect a set of true-false questions; you will need to get something like 8 or 9 out of 10 right in order to get any credit at all. To prepare, look at Assignment One, but also make up some true-false questions about material later in the course and try them out on your classmates, your children or your dog. For example, answer True or False:
There is one natural dependent variable and three natural independent variables. It's interesting to examine relationships between the DV and each IV controlling for the other two, but if you subdivide the data too much the analysis becomes comlicated by small or zero cell frequencies. So, after some fooling around, I decided to limit the analysis for the final exam to the following questions:
We will not worry about expected frequencies less than 5, just about those less than 1.
The idea here is that one can obtain the density of a person's body by immersing him or her in liquid and measuring the volume of the liquid that is displaced, but this is quite inconvenient. We would much rather make some measurements with a scale and a measuring tape.
There are two estimates of percent body fat based on body density; they are based on Brozek's equation and Siri's equation, and they usually give very similar results. Average the two body fat estimates. This is the dependent variable.
Now develop an equation for predicting percent body fat from the other measurements in the file. There are several variables that you should not use as predictors; you should be able to figure out which ones they are.
I am going to do a stepwise regression, the one combining forward and backward steps. I will use the default settings. In my opinion, it is not so important that we have exactly the same numerical answers on this one. On the other hand, at each step you should be able to say what tests are being performed (state the full and reduced model for each one), and how the routine decides what to do next (enter a variable, delete a variable, or stop). The end result is a prediction equation. If I give you a set of measurements, you should be able to give me an (estimated) percent body fat.
I am interested in testing whether fuel efficiency is related to country of origin, first ignoring weight and length, and then controlling for weight and length. In each case, if the initial test is significant, I plan to follow up with all pairwise comparisons, Bonferroni corrected. When we're controlling for the covariates, the sample means no longer give us an adequate picture of what's going on: Why? Instead, we'll look at predicted Y (y-hat) values, with the independent variables set to their sample mean values. That's the sample mean for the entire data set, not just that group. Make a table; the rows are country. For each country, generate a predicted Y. Call the predicted Ys "corrected means." You will need a calculator, for sure. I am going to use cell means coding, so you should too. Of course there is no harm in doing it in another way too, as a check. For Japanese cars, my corrected mean is 7.63.
I'll also test weight and length controlling for country -- why not? For all such questions, what's the full model, the reduced model? What, if anything, do the t-tests mean?
In plain language, what do you conclude from this analysis?
Farm Data: In an agricultural experiment, the cases are 10 farm fields. Five fields were randomly assigned to one irrigation method, and 5 were assigned to the other method. Then each field was subdivided into two plots of equal size. Within each field, one plot was randomly assigned to Fertilizer Type One, and the other plot was assigned to Fertilizer Type Two. The dependent variable is crop yield.
Of course the plots from a given field are quite similar in important respects like soil fertility and average moisture; they definitely should not be considered independent. We are interested in testing for
Here are the data:
--------------------------------------------------------------------- Field 1 2 3 4 5 6 7 8 9 10 --------------------------------------------------------------------- Irrigation method 1 1 1 1 1 2 2 2 2 2 --------------------------------------------------------------------- Yield with Fertilizer 1 43 40 31 27 36 63 52 45 47 54 Yield with Fertilizer 2 48 43 36 30 39 70 53 48 51 57 ---------------------------------------------------------------------I'm going to do this with both proc glm and proc mixed. In proc mixed, I'm going to do it two ways: with the cs covariance structure and the un covariance structure. Compare the results!
To check your work on this one, do a trial run with the compound symmetry covariance structure. Your F tests will be exactly the same as the "Univariate Tests of Hypotheses for Within Subject Effects" from your Assignment 11 proc glm output. The between subjects F will be identical too. Then edit your program to make type = un and re-run it.
Wine Judge 1 2 3 4 ------------------------- 1 20 24 28 28 2 15 18 23 24 3 18 19 24 23 4 26 26 30 30 5 22 24 28 26 6 19 21 27 25
The cases are judges. We want to know if there is a significant difference in the rated quality of the 4 wines. If there is a difference, we want to know which wines are different from which other ones. I corrected an error in the data on Dec. 5th.
For these data, I am fairly comfortable with a compound symmetry assumtion, which says essentially that ratings of different wines are correlated because each judge has a personal tendency to rate everything relatively a bit high or a bit low. Now, the your Bonferroni-corrected pairwise comparisons should use the compound symmetry assumption too. This means that a bunch of ordinary matched t-tests are out, because the compound symmetry assumption implies that the population variance of each difference between means must be the same, and should be estimated using all the data -- not just the data used to calculate the sample difference between means. The easiest way to do it is with proc mixed. So, although you could get the overall F-test from either proc glm or proc mixed, I am going to stick to proc mixed, and you should too. I get an overall F = 57.5.
Here are a few more miscellaneous hints and commments: