STA442/2101 Final Exam Information


Time and Location

The final exam will be on Friday Dec. 12th from 2 to 5 p.m. in GB405

Jerry's Office Hours for the Final

Office hours will be in Bissel 114, the same basement location we have had for most of the term.

Here are my answers to Quizzes 1-10, except for the R code. From this point on, no further changes to the marking of the quizzes 1-10 will be considered. I will post the solution to Quiz 11 once you've had a chance to get your quiz back and look it over.

Format

You will write your answers on the question paper. The exam will be closed book and closed notes. You should bring a calculator with a natural log and exponential function. Any kind is acceptable unless it has communications capability). Pencil is okay.

The current (and final) formula sheet will be supplied with the quiz. It is a very good idea to be familiar with it, so you can find what you need easily.

There are 8 questions, occupying 15 pages. Some of the pages consist of R output. Most of the questions have more than one part. The questions are not equally difficult, and not equally time-consuming. The questions on assignments and quizzes are a good indication of what to expect.

Coverage

The final exam is cumulative. What you are supposed to be able to do is indicated by the assignments, including the R part of this one. The textbooks lecture overheads are intended to help you understand how to answer questions like the ones in the assignments. The textbooks are de-emphasized. My feeling is that this exam could be much harder than it is. A pretty good student should be able to get a very good mark.

Not all parts of the course are equally represented. Early versions of the exam were much too long. The exam is quite predictable, but some things you would expect to see do not appear because they were cut out.

R

You will not be asked to write any R code on the final. There will be four data sets; they are described below. To prepare for the final exam, familiarize yourself with the data sets and analyze them using methods from the course. Try different dummy variable coding schemes. Draw conclusions, and be ready to state your findings in plain, non-statistical language. You will not bring your printouts to the exam. Instead, you will get a copy of my printout, and will answer questions based on them. The idea is that even if you do not do things exactly the same way I do, you will understand the output a lot better and faster if you have done it yourself. Note that I may or may not center the data for some analyses.

If you do nothing else, at least familiarize yourself with the studies and variables. The final exam assumes you are familiar with these data sets. It does not include a full description of the studies and variables. During the exam, I will answer questions about the data, but only if the answers are very brief.

You will notice below that unlike the R assignments during the term, you are not always being asked specific sample questions about the data sets. This time, it is your job to ask the relevant questions and choose the statistical techniques that will help you answer them. The questions on computer assignments during the term should be your guide. For some of the data sets, more than one statistical technique is applicible, and you should not hesitate to do more than one kind of analysis. Be prepared to follow up any significant tests with multiple comparisons.

Of course you may discuss the questions with other people, but this is not the time to let yourself be convinced too easily by your friends. I promise you that in several cases, there is more than one set of questions you could ask about the data, and (correspondingly) more than one natural and reasonable analysis. Please avoid tunnel vision, and do it your way first. Then compare answers. This way, it's more likely that somebody will think of what I'll do on the exam. In a group setting, if four people come up with six analyses, the whole group will benefit. The question is not which analysis is right, but whether each one is reasonable (or not).

Thirty-four marks out of 100 are based on R output.

The Birth Weight Data

This is a built-in R data set. In the R Package manager, check MASS. Then click on birthwt to see a description of the data set. One possible response variable is baby's weight at birth, but also the "indicator of birth weight less than 2.5 kg" variable is clinically meaningful because babies in that category tend to have health problems.

The Bunnies Data

I know this is gruesome, but the data are real -- from the U of T School of Dentistry.

An experiment in dentistry seeks to test the effectiveness of a drug (HEBP) that is supposed to help dental implants become more firmly attached to the jaw bone. This is an initial test on animals. False teeth were implanted into the leg bones of rabbits, and the rabbits were randomly assigned to receive either the drug or a saline solution (placebo). Technicians administering the drug were blind to experimental condition.

Rabbits were also randomly assigned to be "sacrificed" after either 3, 6, 9 or 12 days. At that time, the implants were pulled out of the bone by a machine that measures force in newtons and stiffness in newtons/mm. For both of these measurements, higher values indicate more healing, because it takes more force to pull out the tooth. A measure of "pre-load stiffness" in newtons/mm is also available for each animal. This may be another indicator of how firmly the false tooth was implanted into the bone, but it might even be a covariate. Nobody can seem to remember what "preload" means, so we'll ignore this variable for now. The variables are

  1. Identification code
  2. Time (3,6,9,12 days of healing)
  3. Drug (1=HEBP, 0=saline solution)
  4. Stiffness in newtons/mm
  5. Force in newtons
  6. Preload stiffness in newtons/mm
The main question in this study is whether the HEBP drug helps the dental implants become more firmly attached to the bone. I will use aggregate to look at the treatment means. Use header=T when reading the data.

The Salmon Data

These data represent growth for a sample of Alaskan and Canadian salmon. Apparently, growth during different time periods can be estimated by the diameter of rings in a fish's scales. We have two measurements of growth: marine growth (growth during the fishes' first year of life in the ocean) and freshwater growth. The variables are:

I think of this as a three-factor design. Because there is only one difference variable, you can do the whole analysis with univariate methods; there is no need for manova. I will use aggregate to look at the treatment means, but only to understand effects that are statistically sigificant. Use header=T when reading the data.

The Blood Flow Data

A clinician studied the effects of 2 drugs used either alone or together on the blood flow of human subjects. Twelve healthy middle-aged men participated in the study; they are viewed as a random sample. Each of the men received all four treatment combinatons in a random order, with 2-week resting periods in between. The four values for each subject are increases in blood flow compared to a single baseline measurement. Here is the format of the data file:

                            No Drug A          Yes Drug A
                          _____________      ______________
                          No B     Yes B      No B     Yes B
    Patient               ----     -----      ----     -----
        1                  2         10         9        25  
        2                 -1          8         6        21  
        3                  0         11         8        24  

More comments and suggestions

An anonymous (I hope) graduate student marked the quizzes, but I mark the final examination. We have basically the same standards and objectives, but we are not identical (lucky for him). You might say that this section is about my personal peculiarities -- just in the way I mark exams, of course. It is helpful for you to know about this, so your exam-taking strategy will not conflict with my exam-marking strategy.

The purpose of this course is to help you learn to use statistical methods to draw reasonable conclusions from numerical data. Often, the first several parts of a question will ask for technical details, and the last part will ask for a conclusion (often in plain, non-statistical language). If the technical part is missing, it does not matter what you conclude. Similarly, an answer that has most of the technical details right but gets the conclusion wrong (or leaves it off, or states it incompletely) is almost worthless, and will get few marks. On the other hand, if you make minor technical mistakes but draw reasonable conclusions from what you have, you can still get substantial marks.

When I read an answer, my main goal is to verify that you know what's going on. Here are some more details, mostly about what to avoid.

  1. Make sure you answer the question that is asked.
    1. If you answer another question instead of the one that's asked, you will lose substantial marks. It is especially risky to just dump memory and answer a similar question from one of the assignments. If I detect this, you will get a zero for the question. Thinking is what's important. Memory without thinking is a crime that you should try to hide if you do commit it.
    2. If you answer the question and also write something correct that is not asked, you will not get any extra marks. Your marks will be based on your answer to what is asked.
    3. However, if you say something off-topic that is wrong, you can definitely lose marks. To repeat, if you write a perfect answer to the question that is asked, and also write something incorrect, you will lose marks.
  2. Vocabulary is important. A large part of this course is about communication. You must be able to deal with the subject matter using both technical terms and plain language.
  3. Some questions on the final may ask you to state results "in plain, non-statistical language." Please do not ignore the request for plain language. Regardless of what you say, if plain language is requested then you will get zero marks if you mention the null hypothesis, or use any statistical or technical terms like, correlation, regression, ANOVA, statistically significant, factorial design, positive relationship, controlling for, and so on. Even the word "significant" (without "statistically") should be avoided; it's borderline.
  4. It is also very important in describing a set of findings to say what happened! For example, do not just say that the average amount of rot in potatoes was related to temperature. Instead, say that there was more rot on average at warmer temperatures.

    In a real-world situation (and in the artificial world we presently inhabit, too), you don't get part marks for an answer that (correctly) indicates a relationship is present, but does not say what it is. Imagine you are working in marketing, and you leave a voice mail that says "Consumers recalled one of the commercials better than the other one." Click. Are you trying to frustrate your boss? Are you trying to get fired?

  5. Some professors mark by looking for the correct answer, or part of it. If they find something good, you get points for it. This can encourage a kind of shotgun strategy for writing answers. Just write everything you can think of, and maybe some of it will be what this peculiar individual is looking for.

    But that strategy backfires when I mark an exam, because (except for simple numerical answers) I usually do not give marks for things that are correct; I take off marks for things that are wrong or missing. So, if a student writes a long answer that includes the correct conclusion, the wrong conclusion (based on the same information!) and something irrelevant, all I really see is the contradiction between the two conclusions, and I will probably give the answer a zero. Yet it might be that the student understands everything perfectly, but is just writing all the crazy stuff as insurance against the unlikely possibility that maybe that's what I am looking for. Let's make sure that you don't fall into this trap!