SAS Assignment for Final Exam



Please do this assignment in preparation for the final exam. As usual, bring your log and list files. It is possible that you will be asked to attach part of your output as a component of the exam. You will definitely be asked to answer questions based on your output, in some cases using a calculator.

The assignment is based on the SEIC dataset in Appendix C. See text for a description of the variables. The data are available on the data disk that came with your text, and also here.

For all analyses, the dependent variable is infection risk. Make a command file that reads the data, produces frequency distributions of the qualitative variables, and produces simple descriptive statistics including means and standard deviations for the qantitative variables. Make dummy variables for region. In most if not all analyses, "region" will refer to the collection of dummy variables. To be safe, always produce sequential sums of squares. If you are using proc reg, this is done with the / ss1 option. In proc glm, it's always produced by default.

  1. Just to make sure you're reading the data right, here are some very simple questions with answers.
  2. Fit a model with just one independent variable - number of nurses. This is simple regression. At alpha=0.05, is there evidence of a linear relationship between number of nurses and infection risk? Is the relationship statistically significant? What proportion of the variation in infection risk is explained by number of nurses? Do you reject the null hypothesis? What is the null hypothesis? Express the null hypothesis in matrix form. In non-technical language, what do you conclude? Zero points for statistical jargon. As an example of a good answer, fill in the blank. "Hospitals with more nurses tend to have _____ infection risk." In your opinion, is this strange? Answer yes or no. Is this an experimental study, or observational? In your opinion, for this model, is there a causal relationship between the independent variaable and the dependent variable?
  3. Fit a model with just one independent variable - Medical School Affiliation. Again, this is simple regression. What does b0 mean in your output? If you were to re-express medical school affiliation as a 0-1 variable, what would the intercept mean then? Estimate the difference in expected infection risk between hospitals with and without a medical school affiliation. What proportion of the variation in infection risk is explained by medical school affiliation?

    At alpha=0.05, is there evidence of a difference infection risk between hospitals with and without a medical school affiliation? Is the difference statistically significant? Do you reject the null hypothesis? What is the null hypothesis? Express the null hypothesis in matrix form. In non-technical language, what do you conclude? Zero points for statistical jargon. As an example of a good answer, fill in the blank. "Hospitals _____ a medical school affiliation tend to have lower infection risk." In your opinion, is this strange? Answer yes or no. Is this an experimental study, or observational? In your opinion, for this model, is there a causal relationship between the independent variaable and the dependent variable?

  4. Now fit a model with two independent variables -- number of nurses and average daily census. What has happened to the relationship between number of nurses and infection risk? Once you control for average daily census, what proportion of the remaining variation in infection risk is explained by number of nurses?
  5. Fit a model with all the independent variables except 1, 5 and 6, of course representing region by its dummy variables. This model has everything but those strange xray and culturing ratio variables. We'll call it the basic model.
    1. Controlling for all other variables in the model, how is number of nurses related to infection risk?
    2. Test geographic region controlling for all other variables in the model.
    3. Size of the hospital is roughly represented by three variables: number of beds, average daily census, and number of nurses. Test these three variables simultaneously controlling for all other variables in the model.
    4. By my count, five independent variables fail to reach statistical significance at alpha=0.05 by a t-test (excluding the dummy variables for region). Test these variables simultaneously.
    5. Now do the same thing except include the dummy variablles for region in the test. Now you are testing 8 independent variables simultaneously.
  6. This last test suggests a model with just two independent variables: length of stay and number of nurses. Fit this model specifying the two independent variables in that order, and at alpha=0.05, answer these questions.
    1. When you control for length of stay, does number of nurses help predict infection risk? In non-technical language, what do you conclude? The answer is a statement about number of nurses and infection risk.
    2. Once you have allowed for length of stay, what proportion of the remaining variation in infection risk is explained by number of nurses?
  7. Give a three-variable model (a subset of the basic model) that does NOT have any dummy variables for region, and that explains at least 35 percent of the variation in infection risk. Two of the variables are number of nurses and length of stay. Is the third one significant when you control for the other two?
  8. The answer to that last question was no, so we are back to a 2-variable model. Fit a model with average length of stay, number of nurses, and those 2 "ratio" variables -- numbers 5 and 6. What proportion of the remaining variation do they explain? Is it significant at alpha=0.05? What is the value of the test statistic? I get F = 19.25.
  9. Now we have a 4-variable model that explains over 50 percent of the variation; not bad. But what if we had pooled the two "ratio" variables with the others from the beginning (excluding region)? Can we come up with a better 4-variable model this way? Try it. Comment on the result from the standpoint of the nursing profession. As an example of regression diagnostics (which is what you would do now), consider Studentized deleted residuals from the 4-variable model with length of stay, available faciities and services, and the two "ratio" variables. Perform a set of t-tests with a Bonferroni correction to see if any of the residuals are too big. Get your critical value using proc iml. Which observations, if any, are designated as outlying?