STA441 Assignment 6
Quiz on Thursday February 25th in Tutorial (Bring a calculator.)
The following
formulas
may be useful, but you do not need to memorize them. If they are necessary, they will be provided with the quiz.
Note: For the purposes of this assignment, all test statistics are Wald chi-squares.
- If P(A) = 2/3, the odds of A equal _____.
- If P(A) = 1/4, the odds of A equal _____.
- If the odds of A equal x, the probability of A equals _____.
-
- Consider a logistic regression in which the cases are newly married couples with both people from the same religion, the explanatory variable is religion (A, B, C and None -- let's call "None" a religion), and the response variable is whether the marriage lasted 5 years (1=Yes, 0=No).
- Make a table with four rows, showing how you would set up indicator dummy variables for Religion, with None as the reference category.
- Add a column showing the odds of the marriage lasting years. The symbols for your dummy variables should not appear in your answer, because they are zeros and ones, and different for each row.
- What is the ratio of the odds of lasting 5 years or more for religion C to the odds of lasting 5 years or more for No Religion? Answer in terms of the β symbols of your model.
- What is the ratio of the odds of lasting 5 years or more for religion A to the odds of lasting 5 years or more for Religion B? Answer in terms of the β symbols of your model.
- You want to test whether Religion is related to whether the marriage lasts 5 years. State the null hypothesis in terms of one or more β values.
- You want to know whether marriages from Religion A are more likely to last 5 years than marriages from Religion C. State the null hypothesis in terms of one or more β values.
- You want to test whether marriages between people of No Religion have a 50-50 chance of lasting 5 years. State the null hypothesis in terms of one or more β values.
- Data are collected on university students who were looking for employment immediately following graduation. Three pieces of information are available for each student:
- Academic Division (Humanities, Sciences or Social Science)
- Final cumulative Grade Point Average
- Whether they were employed full time 6 months after graduation: Yes or No.
Consider a logistic regression model with no intercept, cell mean coding, and GPA centered by subtracting off the mean for the entire sample.
- Make a table with three rows, showing how you would set up indicator dummy variables for Academic Division.
- Add a column showing the odds of being employed. The symbols for your dummy variables should not appear in your answer, because they are zeros and ones, and different for each row.
- What is the ratio of the odds of being employed for a Humanities graduate to the odds of being employed for a Sciences graduate? Answer in terms of the β symbols of your model. Does this odds ratio depend on GPA? Answer Yes or No.
- What is the ratio of the odds of being employed for a Sciences graduate to the odds of being employed for a Social Sciences graduate? Answer in terms of the β symbols of your model. Does this odds ratio depend on GPA? Answer Yes or No.
- One grade point on a four point scale is pretty large. When GPA increases by one point, the odds of being employed are multiplied by _____. Answer in terms of the β symbols of your model. Does this odds ratio depend on Academic Division? Answer Yes or No.
- Controlling for GPA, you want to test whether students from the different academic divisions have different chances of finding a job. State the null hypothesis in terms of one or more β values.
- You want to know whether, controlling for Academic Division, the chances of finding a job depend on your marks. State the null hypothesis in terms of one or more β values.
- What is the probability of employment for a Sciences graduate with average GPA? Answer in terms of the β symbols of your model.
- Some people say that allowing for marks, a Humanities graduate has no better than a 50% chance of finding work within 6 months of graduation. What null hypothesis would you use to test this claim? State the null hypothesis in terms of one or more β values.
- The file
heart.txt
contains data from a long-term study of middle-aged male employees of the Western Electric Company in the 1950's. The first part of the file gives descriptions of the variables. This part should be stripped off or skipped using the firstobs option on the infile statement.
Please write a SAS program that reads and labels the data, including a
proc format. This data file contains numeric missing value codes; 99, 999 and so on. You should convert them to the SAS missing value code using if statements (not a text editor!). In addition to the variables in the file, please create an additional quantitative variable: Body Mass Index (BMI) The Wikipedia has a definition at
http://en.wikipedia.org/wiki/Body_mass_index.
- Obtain means and standard deviations of all the quantitative variables. I got a mean years of education equal to 11.6603774, and minimum BMI of 18.8928114.
- Obtain frequency distributions of the categorical variables. It seems that 13 people died on Friday.
- Look at a table of first coronary heart disease event by whether or not the person has coronary heart disease. Does it look okay? If so, relax. If not, track down any problems and fix them using common sense.
The objective here is to find variables that predict presence of coronary heart disease (CHD). One could call this homework assignment "Risk factors in Coronary Heart Disease," and it would sound good.
- Please consider a very simple model with just one explanatory variable: family history of CHD.
- Are the explanatory and response variables related? Answer Yes, No or No Conclusion.
- Give the value of the test statistic; the answer is a number from your printout.
- Give the p-value; the answer is a number from your printout.
- The odds of coronary heart disease are estimated to be ____ times as great for those with a family history of CHD. The answer is a number from your printout.
- Using numbers from your printout and proc iml, estimate the probability of CHD for study participants with a family history of CHD. Also estimate the probability for those without a family history. Be able to do these calculations with a calculator, too. How could you check your answers with proc freq?
- Now add age to the model.
- Controlling for age, is family history of CHD significantly related to CHD? Answer Yes or No and give the value of the test statistic and the p-value (numbers from the printout).
- Controlling for family history of CHD, is age significantly related to CHD? Answer Yes or No and give the value of the test statistic and the p-value (numbers from the printout).
- Give the value of the test statistic and the p-value for the simultaneous test of age and family history of CHD.
- Controlling for family history of CHD, for each year of increase in age, the estimated odds of coronary heart disease are multiplied by ____. The answer is a number from your printout. Please disregard the significance test this time.
Looking back at your proc means, note the age range. Does this help explain why age was not significant?
- Now fit a larger model in which the explanatory variables are
- Family history of CHD
- Age
- Reported number of cigarettes per day
- Blood pressure
- Cholesterol level
- BMI
- Education
Look at the default output, and then carry out a simultaneous test of the set of explanatory variables that are not significantly related to CHD, controlling for all the others. Give the value of the test statistic and the p-value. Does it look okay to drop all these variables from the model?
- It is amazing, but we seem to have only two useful explanatory variables. Fit the model with just those two.
- Fill in the blank. Allowing for education, the more you smoke, the ____ likely you are to have CHD.
- Fill in the blank. Allowing for smoking, the more educated you are, the ____ likely you are to have CHD.
- When we control for reported number of cigarettes per day and increase reported years of education by one year, the estimated odds of coronary heart disease are multiplied by ____. The answer is a single number from your printout. Does this make sense?
- Use proc iml to estimate the probability of coronary heart disease for a man with 16 years of education who smokes 25 cigarettes per day. Be able to do this calculation with a calculator, too.
- Use proc iml to estimate the probability of coronary heart disease for a man with 12 years of education who smokes zero cigarettes per day. Be able to do this calculation with a calculator, too.
- Summarize the results of this study in plain, non-statistical language.
- Why do the results for age illustrate the danger of formally accepting the null hypothesis with too much certainty?
Please bring your log file and your results file to the quiz. Bring a calculator.