STA429/1007 Assignment 8

Quiz on Thursday Nov. 15th at 10:10 a.m.


This assigment will guide you through a logistic regression analysis of the Heart data from Assignment 7. The dependent variable will be whether the participant was dead 10 years after the beginning of the study. Please use the original coding, in which 0=alive and 1=dead. You may call this variable anything you like, but I will refer to it as "Died" or "Death."

First, create a new Body Mass Index (BMI) variable. The Wikipedia has a formula that applies to weight in pounds and height in inches. Now, please follow these steps.

  1. Do a crosstab of Coronary Heart disease by Died, including chisquare tests. What is the estimated probability of death given no coronary heart disease? The answer is a number.
  2. Now you'll see why this causes problems. Consider a logistic regression with just one independent variable: the binary variable indicating presence versus absence of Coronary Heart Disease. It goes like this:

    Log Odds of Death = β0 + β1x, so Odds of Death = eβ0 + β1x.

    With x=0 (No Coronary Heart Disease), we have

    Odds of Death = eβ0.

    But if a probability equals zero, then the odds = zero too. This means

    eβ0=0.

    Unfortunately, there is no number β0 that satisfies this equation, and finding an estimated β0 that satisfies it will not be possible either. This is why the logistic regression blows up if CHD is included as an independent variable. It's a great predictor -- too good!

  3. Proceeding bravely with your eyes closed, use proc logistic to carry out a logistic regression in which the dependent variable is died, and there is just one independent variable: the indicator for Coronary Heart Disease. Please start with proc logistic order=internal descending; just to be safe. On your list file, circle all the indications that something is terribly wrong. You can take it from me that the same thing happens if you include CHD in a logistic regression along with a bunch of other variables.
  4. We are not stopped. Here's how I would like you to proceed. If a patient has no CHD, we predict he will be alive 10 years after entering the study. Now we'll try to predict death for those patients who do have CHD. The next part of my program goes like this (the name of my original SAS data set is heart). You should do something similar.
    data sick;
         set heart;
         if chd=1;   /* Just data for patients with CHD */
    
    Subsequent proc steps will use the most recently created SAS data set, which is sick.

    Now carry out logistic regression on the 104 people with Coronary Heart Disease. The dependent variable is died, and the independent variables are age, education, BMI, diastolic blood pressure, cholesterol level, number of cigarettes, and family history of heart disease. (Do you have anybody with 99 years of education? What do you think you should do about it?)

    1. There is a likelihood ratio test for all the independent variables simultaneously. Give the value of the test statistic (I get 15.3799), the degrees of freedom, and the p-value. Is it significant? What do you conclude?
    2. Only one of the independent variables is significant controlling for all the others. Which one is it? Describe the finding in plain language.
    3. If age increases by one year with all the other independent variables held constant, the odds of death are multiplied by _____. The answer is a number on the printout.

Please bring your log file and your list file to the quiz.