STA441s16 Final Assignment


Please notice that the Scab Disease data set has been replaced with the Tooth Growth data.

The final exam, like the quizzes, will have a computer part. Links to several data sets are given below. Your job is to get familiar with them and do appropriate analyses in preparation for the exam. On the exam, I will provide my SAS programs and results files. You will answer questions based on my input and output. What I do will be quite predictable, so even if you do not do exactly what I do, my input and output should be fairly easy to follow if you prepare. Again, you are not going to bring your SAS work to the final exam. You will answer questions based on my SAS work.

In addition to numerical answers and plain-language conclusions, you should be able to do the following for every analysis.

Not all of the data sets below will appear on the final exam. There won't be time.

The TV data

The file TV1.data.txt contains data from a 1982 survey conducted in Stevens County in the United States. Well, actually Stevens county is fictitious, and the data were simulated using a program written by Ted Chang of the University of Virginia (see The American Statistician, 46 (1992), 232-237 for more information), but the details are realistic -- or anyway, they were realistic in 1982. The imaginary "Stevens County" is divided into 75 districts including rural, small-town and urban areas. For each of 500 households interviewed, the data file contains district number, household number within district, assessed value of home in US dollars (an indirect measure of income, which was not asked), and answers to 9 questions related to the respondents' interest in getting cable TV. The variables are:

  1. District: 1-25 are rural, 26-50 small town, 51-75 city.
  2. Household (numbered within district)
  3. Assessed value of home in US dollars
  4. Number of persons 12 and older in household
  5. Number of persons 11 and younger in household
  6. Number of TV sets in Household
  7. Price willing to pay for cable TV
  8. Total TV hours watched last week (add hours for all persons in household)
  9. Hours Public Affairs watched last week
  10. Hours Sports watched last week
  11. Hours Children's programming watched last week
  12. Hours Movies watched last week

When you look at the data file, you will see that the columns with the 9 survey questions are numbered 1 through 9. My variable names are q1-q9. The primary response variable is q4: Price willing to pay for cable TV. I am going to make a variable called Location with three values: Rural, Small town, and City.

The Tooth Growth data

The Tooth Growth data are in the file ToothGrowth.data.txt. The response is the length of odontoblasts (teeth) for 10 guinea pigs at each of three dose levels of Vitamin C (0.5, 1, and 2 mg) with each of two delivery methods (orange juice or ascorbic acid).

The Donner Party data

The file donner.data.txt contains data from the ill-fated Donner party, a group of American pioneers who, in the mid 1800s, decided to attempt a new and untested route over the Sierra Nevada mountains. They were snowed in, and the legend is that the survivors were forced to cannibalism. The data file supposedly contains three pieces of information from each adult (15 and over) in the party. I say supposedly because the historical record is not perfect, and there is even room for disagreement about what it meant to be a member of the Donner party, because some people split off from the party during the trek, rejoined later or not, and so on.

Anyway, the variables are

The Math data

These data were used extensively in lecture, starting with SAS Example 4. Please borrow my code for reading and cleaning up the data. That way we will have the same variable names and it will be possible for us to get exactly the same results. All I'm going to do is see if the Calculus course the students choose to take is predictable from their High School information. I'm going to use proc logistic, not proc catmod.

Beat the Blues data

The file BeatTheBlues.data.txt contains data from a longitudinal clinical trial of an interactive, multimedia program known as "Beat the Blues" designed to deliver cognitive behavioural therapy to depressed patients via a computer terminal. Patients with depression recruited in primary care were randomised to either the Beating the Blues program, or to "Treatment as Usual" (TAU). The variables are

This is a very rich data set. Start looking at the data file and exploring it with simple descriptive statistics and elementary tests. How are the variables related to one another 2 at a time? I found one thing I did not expect, and it's a bit disturbing.

Some people disappeared. That's another variable. Is disappearance at random, or is it related to other variables in the study? If disappearance is not at random, how might it bias the results? Think about it.

Once you've explored the data, do some analyses that try to answer the main research question.