STA442/1008 Final Exam Information
Time and Location
The final exam will be on Thursday April 12th from 9 a.m. to 12 p.m. in Gym C (Davis Building).
Jerry's Office Hours for the Final
- Tuesday April 3d, 10:30-12:30
- Thursday April 5th 10:30-12:30
- Monday April 9th 11:00-2:00
- Wednesday April 11th 11:00-2:00
Quiz solutions are now posted below.
Format
You will write your answers on the question paper. The exam will be closed book and closed notes. You should bring a calculator (any kind is acceptable unless it has communications capability). Pencil is okay.
There are 10 questions, occupying 9 pages. Most of the questions have more than one part. The questions are not equally difficult, and not equally time-consuming. The questions on assignments and quizzes are a good indication of what to expect.
The last 4 questions (worth 52 out of 100 marks on the exam) are based on SAS output that will be provided to you with the exam. Again, the type of questions will be familiar from the assignments and quizzes. More information about the SAS part is given below.
Coverage
The final exam is cumulative. What you are supposed to be able to do is indicated by the assignments, including the SAS part of this one. The text and lecture overheads are intended to help you understand how to answer questions like the ones in the assignments.
Not all parts of the course are equally represented. Here are some details.
- The basic concepts and vocabulary of Chapter One are important,
and worth a relatively large amount of marks. See Assignment 1. There
will be some True-false. You get either full marks or zero on the
True-False, and you will have to get 10 out of 13 right in order to get
any marks. Some of the true-false questions will be about material from
later in the course, not just Chapter 1. Do you think you might have to
make up a study?
- Based on a data-oriented question (like "Given level of rainfall,
is soil acidity a useful predictor of crop yield?") able to state the
null hypothesis you'd test. The answer will be some statement about
μ or (more likely) β quantities. This kind of question is
quick, clean, and requires you to connect the research questions of a
study to the statistical model. There is more than one question of this
type. Together, they are worth a lot of marks.
- Regression is important. Review the ideas
of full versus reduced models, interpretation of regression
coefficients, and dummy variable coding for categorical independent
variables. How many kinds of dummy variable coding do you know? They are all on the exam. You may be asked to calculate Y-hat. Formulas (including the formula for F in terms of a and a in terms of F) will be on the cover sheet of the exam if you
need them; you need not memorize any formulas. Bring a
calculator. Do you think you might have to fill in a table of
E[Y|X] for the different values of a categorical independent
variable?
- Of course you will be asked to set up tests for main effects and interactions in terms of contrasts or regression coefficients.
- Analysis of within-cases data is important. Of the three main
methods for treating such data, the old-fashioned mixed models (in which
subject is a random effect nested within the between-cases factors) is in
the text only (no lecture) and will not be on the exam. As you know, these
tests appear as a by-product of the multivariate approach (labelled
"univariate tests"), but you will not be asked any questions about them. But for any data where a within-cases analysis is possible, you should be prepared for either the multivariate approach or the covariance structure approach. You will not see both for the same data set, but you should do it both ways to prepare for the exam.
- Bonferroni follow-ups will be emphasized over Scheffé. You
don't need to know any formulas, except maybe that the adjusted
p-value = p*k.
- Early versions of the exam were much too long. The exam is very predictable, but some things you
would expect to see do not appear because they were cut out.
You will not be asked to write any SAS code on the final. There will be five data sets, two of which you have seen before. They are described below. To prepare for the final exam, familiarize yourself with the data sets and analyze them using methods from the course. My SAS variable names are given, and I suggest you use them, even for the furnace data. Draw conclusions, and be ready to state your findings in plain, non-statistical language. You will not bring your printouts to the exam. Instead, you will get a copy of my printouts, and will answer questions based on them. The idea is that even if you do not do things exactly the same way I do, you will understand the output a lot better and faster if you have done it yourself. Note that I may or may not center the data for some analyses.
If you do nothing else, at least familiarize yourself with the studies and variables. The final exam does not include a full description of the studies and variables. During the exam, Cristina and I will answer questions about the data, but only if the answers are very brief.
You will notice below that unlike the SAS assignments during the term,
you are not always being asked specific sample questions about the data
sets. This time, it is your job to ask the relevant questions and
choose the statistical techniques that will help you answer them. The
questions on computer assignments during the term should be your guide. For
some of the data sets, more than one statistical technique is applicible,
and you should not hesitate to do more than one kind of analysis. Be prepared to follow up any significant multivariate tests with Bonferroni-corrected univariate tests. See, if you understand that last sentence, you've learned something in the course.
Of course you may discuss the questions with other people, but this is not the time to let yourself be convinced too easily by your friends. I promise you that in several cases, there is more than one set of questions you could ask about the data, and (correspondingly) more than one natural and reasonable analysis. Please avoid tunnel vision, and do it your way first. Then compare answers. This way, it's more likely that somebody will think of what I'll do on the exam. In a group setting, if four people come up with six analyses, the whole group will benefit. The question is not which analysis is right, but whether each one is reasonable (or not).
We have some unfinished business here. We never had a quiz on Assignment 6, so there are some potential exam questions. Also, think of a 2-factor ANOVA: Type of vent damper by in-out. There are 2 main ways I might do it. And elementary tests are always possible. My variable names are: typfurn area shape height liner house age dampin dampout damper.
For 97 countries, the United Nations supplied data on birth rates, death rates, infant death rates, life expectancies for males and females, and Gross National Product. The variables (with my variable names) are:
- birthrate: Live birth rate per 1,000 of population
- deathrate: Death rate per 1,000 of population
- infmort: Infant deaths per 1,000 of population under 1 year old
- lifexM: Life expectancy at birth for males
- lifexF: Life expectancy at birth for females
- gnp: Gross National Product per capita in U.S. dollars
- group: Country Group
- Eastern Europe
- South America and Mexico
- Western Europe, North America, Japan, Australia, New Zealand -- let's just call them "Industrialized."
- Middle East
- Asia
- Africa
- country: Country -- not a variable, more like a case identifier.
To me, the birth and health stuff are the dependent variables.
You saw this data set in
Assignment 10. The data are already set up nicely for proc mixed. My variable names are id Sex Order Predator Distance Calls. If you recall, obtaining the marginal means (including 2-way tables of marginal means, averaging over the 3d variable) was a chore in Assignment 10. But proc mixed gives them to you easily with lsmeans. My strategy will be to list only the effects that are statistically significant. For example, if the only significant effects were the main effect for B, the A by C interaction and the A by B by C interaction, I would say
lsmeans B A*C A*B*C;
These data
represent growth for a sample of Alaskan and Canadian salmon.
Apparently, growth during different time periods can be estimated
by the diameter of rings in a fish's scales. We have two measurements
of growth: marine growth (growth during the fishes' first year of life
in the ocean) and freshwater growth. The variables (with my variable
names) are:
- country: 1=Alaskan 2=Canadian
- gender: 1=Female 2=Male
- fresh: Diameter of rings for first-year
freshwater growth in 100ths of an inch
- marine: Diameter of rings for first-year marine
growth in 100ths of an inch
Either approach to repeated measures is a possibility.
A clinician studied the effects of 2 drugs used either alone or together on the blood flow of human subjects. Twelve healthy middle-aged men participated in the study; they are viewed as a random sample. Each of the men received all four treatment combinatons in a random order, with 2-week resting periods in between. The four values for each subject are increases in blood flow compard to a single baseline measurement. My variable names are Patient NN NY YN YY.
For the past exams, ignore anything with categorical dependent variables, except for basic chi-square tests of independence. Especially, ignore the questions on logistic regression in the 2009 exam.
- STA442 Spring 2008
- STA442 Fall 2009
Here are some of the quizzes (all except number six), with solutions. From this point on, no further changes
to the marking of the quizzes will be considered.
Cristina marks the quizzes, and I mark the final examination. We have
basically the same standards and objectives, but we are not identical
(lucky for her). You might say that this section is about my personal
peculiarities -- just in the way I mark exams, of course. It is helpful for
you to know about this, so your exam-taking strategy will not conflict with
my exam-marking strategy.
The purpose of STA442 is for you to learn to use statistical methods to
draw reasonable conclusions from numerical data. Often, the first several
parts of a question will ask for technical details, and the last part will
ask for a conclusion (often in plain, non-statistical language). If the technical part is missing, it does not matter
what you conclude. Similarly, an answer that has most of the technical
details right but gets the conclusion wrong (or leaves it off, or states it
incompletely) is almost worthless, and will get few marks. On the other
hand, if you make minor technical mistakes but draw reasonable conclusions from
what you have, you can still get substantial marks.
When I read an answer, my main goal is to verify that you know what's
going on. Here are some more details, mostly about what to avoid.
- Make sure you answer the question that is asked.
- If you answer another question instead of the one that's asked, you will lose substantial marks. It is especially risky to just
dump memory and answer a similar question from one of the assignments. If I detect this, you
will get a zero for the question. Thinking is what's important. Memory
without thinking is a crime that you should try to hide if you do commit
it.
- If you answer the question and also write something correct that is not asked, you will not get any extra marks. Your marks will be based on your answer to what is asked.
- However, if you say something off-topic that is wrong, you can definitely lose marks. To repeat, if you write a perfect answer to the question that is asked, and also write something incorrect, you will lose marks.
- Vocabulary is important. A large part of this course is about communication. You must be able to deal with the subject matter using both technical terms and plain language.
- Some questions on the final may ask you to state results "in plain, non-statistical language." Please do not ignore the request for plain language. Regardless of what you say, if plain language is requested then you will get zero marks if you mention the null hypothesis, or use any statistical or technical terms like, correlation, regression, ANOVA, statistically significant, factorial design, positive relationship, controlling for, and so on. Even the word "significant" (without "statistically") should be avoided; it's borderline.
- It is also very important in describing a set of findings to
say what happened! For example, do not just say that
the average amount of rot in potatoes was related to
temperature. Instead, say that there was more rot on average at
warmer temperatures.
In a real-world situation (and in the artificial world we presently
inhabit, too), you don't get part marks for an answer that (correctly)
indicates a relationship is present, but does not say what it is. Imagine
you are working in marketing, and you leave a voice mail that says
"Consumers recalled one of the commercials better than the other one."
Click. Are you trying to frustrate your boss? Are you trying
to get fired?
- Some professors mark by looking for the correct answer, or
part of it. If they find something good, you get points for
it. This can encourage a kind of shotgun strategy for writing
answers. Just write everything you can think of, and maybe some of
it will be what this peculiar individual is looking for.
But that strategy backfires when I mark an exam, because
(except for simple numerical answers) I usually do not give marks for
things that are correct; I take off marks for things that are wrong or
missing. So, if a student writes a long answer that includes the correct
conclusion, the wrong conclusion (based on the same information!) and
something irrelevant, all I really see is the contradiction between the two
conclusions, and I will probably give the answer a zero. Yet it might be
that the student understands everything perfectly, but is just writing all
the crazy stuff as insurance against the unlikely possibility that maybe
that's what I am looking for. Let's make sure that you don't fall
into this trap!