Final Assignment

STA441s16 Final Assignment

Please notice that the Scab Disease data set has been replaced with the Tooth Growth data.

The final exam, like the quizzes, will have a computer part. Links to several data sets are given below. Your job is to get familiar with them and do appropriate analyses in preparation for the exam. On the exam, I will provide my SAS programs and results files. You will answer questions based on my input and output. What I do will be quite predictable, so even if you do not do exactly what I do, my input and output should be fairly easy to follow if you prepare. Again, you are not going to bring your SAS work to the final exam. You will answer questions based on my SAS work.

In addition to numerical answers and plain-language conclusions, you should be able to do the following for every analysis.

State the model. Mostly this involves copying/adapting regression equations of some sort from the formula sheet -- normal regression, logistic regression, or multinomial logistic regression. By "adapting," I mean the number of explanatory variables has to be right. For the covariance structure approach to within-cases, you should be ready to give the matrix Σ as well. The number of rows and columns has to be right.
For any test, be able to state the null hypothesis in Greek letters.
Locate any parameter estimates on the printout. For normal regression with independent errors, the estimate of σ² is MSE.

Not all of the data sets below will appear on the final exam. There won't be time.

The TV data

The file TV1.data.txt contains data from a 1982 survey conducted in Stevens County in the United States. Well, actually Stevens county is fictitious, and the data were simulated using a program written by Ted Chang of the University of Virginia (see The American Statistician, 46 (1992), 232-237 for more information), but the details are realistic -- or anyway, they were realistic in 1982. The imaginary "Stevens County" is divided into 75 districts including rural, small-town and urban areas. For each of 500 households interviewed, the data file contains district number, household number within district, assessed value of home in US dollars (an indirect measure of income, which was not asked), and answers to 9 questions related to the respondents' interest in getting cable TV. The variables are:

District: 1-25 are rural, 26-50 small town, 51-75 city.
Household (numbered within district)
Assessed value of home in US dollars
Number of persons 12 and older in household
Number of persons 11 and younger in household
Number of TV sets in Household
Price willing to pay for cable TV
Total TV hours watched last week (add hours for all persons in household)
Hours Public Affairs watched last week
Hours Sports watched last week
Hours Children's programming watched last week
Hours Movies watched last week

When you look at the data file, you will see that the columns with the 9 survey questions are numbered 1 through 9. My variable names are q1-q9. The primary response variable is q4: Price willing to pay for cable TV. I am going to make a variable called Location with three values: Rural, Small town, and City.

The Tooth Growth data

The Tooth Growth data are in the file ToothGrowth.data.txt. The response is the length of odontoblasts (teeth) for 10 guinea pigs at each of three dose levels of Vitamin C (0.5, 1, and 2 mg) with each of two delivery methods (orange juice or ascorbic acid).

The Donner Party data

The file donner.data.txt contains data from the ill-fated Donner party, a group of American pioneers who, in the mid 1800s, decided to attempt a new and untested route over the Sierra Nevada mountains. They were snowed in, and the legend is that the survivors were forced to cannibalism. The data file supposedly contains three pieces of information from each adult (15 and over) in the party. I say supposedly because the historical record is not perfect, and there is even room for disagreement about what it meant to be a member of the Donner party, because some people split off from the party during the trek, rejoined later or not, and so on.

Anyway, the variables are

Age
Sex (0=M, 1=F)
Survival (0=No, 1=Yes)

The Math data

These data were used extensively in lecture, starting with SAS Example 4. Please borrow my code for reading and cleaning up the data. That way we will have the same variable names and it will be possible for us to get exactly the same results. All I'm going to do is see if the Calculus course the students choose to take is predictable from their High School information. I'm going to use proc logistic, not proc catmod.

Beat the Blues data

The file BeatTheBlues.data.txt contains data from a longitudinal clinical trial of an interactive, multimedia program known as "Beat the Blues" designed to deliver cognitive behavioural therapy to depressed patients via a computer terminal. Patients with depression recruited in primary care were randomised to either the Beating the Blues program, or to "Treatment as Usual" (TAU). The variables are

id: Patient identification code
drug: Did the patient take anti-depressant drugs (No or Yes).
length: The length of the current episode of depression, a factor with values <6m (less than six months) and >6m (more than six months).
treatment: Treatment group, a factor with levels TAU (treatment as usual) and BtheB (Beat the Blues)
bdi_pre: Beck Depression Inventory score before treatment.
bdi_2m: Beck Depression Inventory score after two months
bdi_4m: Beck Depression Inventory score after four months
bdi_6m: Beck Depression Inventory score after six months
bdi_8m: Beck Depression Inventory score after eight months

This is a very rich data set. Start looking at the data file and exploring it with simple descriptive statistics and elementary tests. How are the variables related to one another 2 at a time? I found one thing I did not expect, and it's a bit disturbing.

Some people disappeared. That's another variable. Is disappearance at random, or is it related to other variables in the study? If disappearance is not at random, how might it bias the results? Think about it.

Once you've explored the data, do some analyses that try to answer the main research question.