Assignment 13

STA 301F97 Assignment 13 and hints for Final Exam on Dec. 15 (Room 1143, 2-5 pm)

In your preparation for this exam, you may be interested in a rough text version of the 1995 final, and the printout that goes with it.

The final exam will be comprehensive, covering all the material in the course. It will be a 3 hour exam. You will be allowed to use a calculator; otherwise it is closed book. The format will be similar to that of the term tests, with a printout based on three data sets:

1. The heart data, first used in Assignment 8. In that assignment I gave you my variable names; please use them. The dependent variables you should consider are chd and alive. In case you don't have the data any more, you can get a copy with

cp /student/jbrunner/public/heart .

2. The peru data. The DV of interest is blood pressure -- systolic and diastolic. Get a copy of the data with

cp /student/jbrunner/public/peru .

3. The furnace data. Here the dependent variable of interest is energy consupmption -- both with damper active and damper inactive. Get a copy with

cp /student/jbrunner/public/furnace .

On the final printout, you may see analyses using the following SPSS programs on any of the three data sets, so try a variety of tests and decide what the results mean.

ONEWAY
ANOVA
REGRESSION
MANOVA
LOGISTIC REGRESSION

Moreover, you will not see output from TTEST, CORRELATIONS, or CROSSTABS. So don't bother to run analyses using these programs. But do think about how they might be applied, especially to the furnace data. I will present you with a table of questions based on the furnace data, and you will tell me the independent variable(s), the dependent variable(s), and the most appropriate significance test.

You will be asked to choose the MOST ELEMENTARY test that is appropriate. In other words, if there is one quantitative IV and one quantitative DV, the answer is simple univariate regression, not univariate multiple regression, or simple multivariate regression, or multivariate multiple regression. If there is one categorical IV and one binary DV, the answer is chi-square test of independence, not logistic regression, even though you could set up dummy variables and get the right answer with a logistic regression. If you don't understand the idea here, ask before the exam.

As I mentioned in class, the Wald tests are less reliable that the likelihood ratio chi-square tests corresponding to the F for R-squared change. In my preparation of the exam, I have seen serious inconsistency between the two approaches when they are testing the same hypothesis. It is so bad that on the final exam, disregard the Wald tests entirely! The correct chi-square and p-value will NEVER be from a Wald test. When you are doing your logistic regression analyses, you might use the Wald tests as a guide to which full & reduced models to fit, but don't trust them.

You may be asked to design an original study that would use, say, a three-way multivariate analysis of covariance, or a logistic regression with one quantitative IV and dummy variables for a categorical IV with 3 categories. If two people have the same "original" study, neither will get any marks for the question.

If the idea of experimental vs. observational studies and evidence for causality does not come up, it will be an oversight on my part.

There will be a set of true-false questions. You will have to get almost all of them right in order to get any marks. They will be based on the following information.


The independent variable (or variables) are used to predict the
dependent variable (or variables).

If p < 0.05, it means that data like the ones we have observed are
very unlikely if the independent  variable and dependent variable are
actually unrelated.  We say the results are STATISTICALLY
SIGNIFICANT, and we are free to discuss the results and try to
explain them.  If p > 0.05, we say the results are not statistically
significant and that there is no evidence of a relationship between
the independent variable and the dependent variable.  DON'T GET THIS
BACKWARDS!

R-squared (not r!) is the proportion of variation in the dependent
variable that is explained by the independent variable or variables.
Never confuse this with a p-value.

A non-significant correlation coefficient is NOT evidence of a
curvilinear relationship between two variables.

A positive correlation (like a positive regression coefficient in
simple regression) implies a positive relationship between X and Y --
that is, high values of X go with high values of Y and low values of
X go with low values of Y.

A negative correlation (like a negative regression coefficient in
simple regression) implies a negative relationship between X and Y --
that is, high values of X go with low values of Y and low values of X
go with high values of Y. A negative correlation does NOT imply that
X and Y are independent (unrelated).

In multiple regression, a positive regression coefficient implies
that when all other variables in the equation are held constant,
there is a positive relationship between the independent variable
(the one corresponding to the regression coefficient) and the
dependent variable.  A negative coefficient implies a negative
relationship, again when all other variables are held constant.

If a nominal scale variable (that is, a variable whose values are
labels for unordered categories) has more than two categories, it
just does not make sense to compute the mean.  It does not make sense
to use such a variable as a dependent variable in ordinary
regression.

A matched t-test between a binary (two values) variable and a
continuous variable is a crime against nature and should receive the
death penalty.

An experimental study is one in which cases are randomly assigned to
the different values of an independent variable (or variables).  An
observational study is one in which the vales of all independent
variables are not randomly assigned, but merely observed.

A CONFOUNDING VARIABLE is a variable not included as an independent
variable, that might be related to both the independent variable and
the dependent variable -- and that might therefore create a seeming
relationship between them where none actually exists, or might even
hide a relationship that is present.  Some books also call this a
"lurking variable."  You are responsible for the vocabulary
"confounding variable."

Because of possible confounding variables, only an experimental study
can provide good evidence that an independent variable CAUSES a
dependent variable.  Words like effect, affect, leads to etc. imply
claims of causality and are only justified for experimental studies.

Suppose we have independent variables A and B, and a dependent
variable C.  An INTERACTION between A and B means that the
relationship of A to C depends on the value of B -- or equivalently,
that the relationship of B to C depends on the value of A.  It does
not, repeat NOT mean that A and B are related.

Suppose we have a nominal scale independent variable (A) and a
continuous independent variable (B) in a multiple regression.  In
this case an interaction between A and B means that the regression
lines (one for each value of A) are not parallel.

I will also feel free to (maybe) include absolutely standard
examples, and ask if they are reasonable.  Like for example:  "In
order to describe a possible linear relationship between high school
grade point average (GPA) and university GPA, it makes sense to
compute a correlation coefficient."  The answer is true.  Yeah, we
should look at a scatterplot too, but that does not make the
correlation coefficient meaningless.  And the rule is, NO TRICK
QUESTIONS on the true-false.

Notice that the true-false questions may sometimes be based on
EXAMPLES that refer to the information listed above.  Another sample
question: "In a study investigating the relationship of university
grade point average and income ten years after graduation, we observe
r = .52, p < 0.0001. There is no evidence of a relationship between
grades and income." False.