SAS Assignment for Final Exam
Please do this assignment in preparation for the final exam. As usual,
bring your log and list files. It is possible that you will be asked to attach
part of your output as a component of the exam. You will definitely be asked
to answer questions based on your output, in some cases using a calculator.
The assignment is based on the SEIC dataset in Appendix C. See text for a
description of the variables. The data are available on the data disk that
came with your text, and also here.
For all analyses, the dependent variable is infection risk. Make a command
file that reads the data, produces frequency distributions of the qualitative
variables, and produces simple descriptive statistics including means and
standard deviations for the qantitative variables. Make dummy variables for
region. In most if not all analyses, "region" will refer to the collection of
dummy variables. To be safe, always produce sequential sums of squares. If you
are using proc reg, this is done with the / ss1 option. In proc glm, it's
always produced by default.
- Just to make sure you're reading the data right, here are some very simple questions with answers.
- What is the mean number of nurses? Answ: 173.2477876
- What percentage of hospitals in the sample have a medical school affiliation? Answ: 15.04
- Fit a model with just one independent variable - number of nurses. This
is simple regression. At alpha=0.05, is there evidence of a linear
relationship between number of nurses and infection risk? Is the relationship
statistically significant? What proportion of the variation in infection risk
is explained by number of nurses? Do you reject the null hypothesis? What is
the null hypothesis? Express the null hypothesis in matrix form. In
non-technical language, what do you conclude? Zero points for statistical
jargon. As an example of a good answer, fill in the blank. "Hospitals with
more nurses tend to have _____ infection risk." In your opinion, is this
strange? Answer yes or no. Is this an experimental study, or observational? In
your opinion, for this model, is there a causal relationship between the
independent variaable and the dependent variable?
- Fit a model with just one independent variable - Medical School
Affiliation. Again, this is simple regression. What does b0 mean in
your output? If you were to re-express medical school affiliation as a 0-1
variable, what would the intercept mean then? Estimate the difference in
expected infection risk between hospitals with and without a medical school
affiliation. What proportion of the variation in infection risk is explained
by medical school affiliation?
At alpha=0.05, is there evidence of a difference infection risk between
hospitals with and without a medical school affiliation? Is the difference
statistically significant? Do you reject the null hypothesis? What is the null
hypothesis? Express the null hypothesis in matrix form. In non-technical
language, what do you conclude? Zero points for statistical jargon. As an
example of a good answer, fill in the blank. "Hospitals _____ a medical school
affiliation tend to have lower infection risk." In your opinion, is this
strange? Answer yes or no. Is this an experimental study, or observational? In
your opinion, for this model, is there a causal relationship between the
independent variaable and the dependent variable?
- Now fit a model with two independent variables -- number of nurses and
average daily census. What has happened to the relationship between number of
nurses and infection risk? Once you control for average daily census, what
proportion of the remaining variation in infection risk is explained by
number of nurses?
- Fit a model with all the independent variables except 1, 5 and 6, of
course representing region by its dummy variables. This model has everything
but those strange xray and culturing ratio variables. We'll call it the
basic model.
- Controlling for all other variables in the model, how is
number of nurses related to infection risk?
- Test geographic region controlling for all other variables in
the model.
- Size of the hospital is roughly represented by three
variables: number of beds, average daily census, and number of nurses. Test
these three variables simultaneously controlling for all other variables in
the model.
- By my count, five independent variables fail to reach
statistical significance at alpha=0.05 by a t-test (excluding the dummy
variables for region). Test these variables simultaneously.
- Now do the same thing except include the dummy variablles for
region in the test. Now you are testing 8 independent variables
simultaneously.
- This last test suggests a model with just two independent variables:
length of stay and number of nurses. Fit this model specifying the two
independent variables in that order, and at alpha=0.05, answer these
questions.
- When you control for length of stay, does number of nurses
help predict infection risk? In non-technical language, what do you conclude?
The answer is a statement about number of nurses and infection risk.
- Once you have allowed for length of stay, what proportion of
the remaining variation in infection risk is explained by number of
nurses?
- Give a three-variable model (a subset of the basic model) that does
NOT have any dummy variables for region, and that explains at least 35 percent
of the variation in infection risk. Two of the variables are number of nurses
and length of stay. Is the third one significant when you control for the
other two?
- The answer to that last question was no, so we are back to a
2-variable model. Fit a model with average length of stay, number of nurses,
and those 2 "ratio" variables -- numbers 5 and 6. What proportion of the
remaining variation do they explain? Is it significant at alpha=0.05? What is
the value of the test statistic? I get F = 19.25.
- Now we have a 4-variable model that explains over 50 percent of the
variation; not bad. But what if we had pooled the two "ratio" variables with
the others from the beginning (excluding region)? Can we come up with a better
4-variable model this way? Try it. Comment on the result from the standpoint
of the nursing profession.
As an example of regression diagnostics (which is what you would do now),
consider Studentized deleted residuals from the 4-variable model with length
of stay, available faciities and services, and the two "ratio"
variables. Perform a set of t-tests with a Bonferroni correction to see if any
of the residuals are too big. Get your critical value using proc iml. Which
observations, if any, are designated as outlying?