STA442/1008 Assignment 7

Quiz on Friday Nov. 11th in tutorial


This assignment is based upon lecture material on stepwise regression and replication of a prediction model.

For this assignment, please start by doing a stepwise regression on the exploratory tv data. By stepwise, I mean the method that combines forward and backward selection. The variable you are trying to predict is Price willing to pay for cable TV (q4). The potential predictor variables are Value of home, q1 to q3, q5 to q9, total number of people in the household, and two indicator dummy variables for Location (please make Urban the reference category). Do variable selection on the exploratory data in three ways, yielding three different regression models:

  1. Model 1: Using the default significance levels to enter or remove a variable from the model.
  2. Model 2: Set both significance levels to 0.05.
  3. Model 3: Just use the first three variables to enter the stepwise models. I suggest using proc reg, not proc stepwise for this one, though you do get the same answer.

For each of these models, you have a regression equation; the coefficients are estimated from the exploratory data. Now we will see how good they are, using a second independent sample from the same population.

The replication data are available in the file tv2.dat. The variables are the same as in tv1.dat.

If you are on tuzo, there is another way to get a copy of the file. At the unix prompt, type

cp /student/etsta313/www/442sas/tv2.dat .

The period is important. It refers to your current directory.

For each of the three models, compute a new variable using the replication data -- predicted Y. That is, you are using the regression equations from the exploratory data to generate three predicted values of price willing to pay, for every case in the replication data file.

Now generate 3 more variables; each one is the absolute value of the difference between observed and predicted Y. Small values represent accurate prediction.

You have done a lot of computation now, and possibly you are unsure whether you have it right. As a check, for the absolute value of the difference between observed and predicted q4 from Model 2, I got a sample standard deviation of 1.7045522. This answer has been verified by Christine. If you are a grad student using Version 6 on credit, your answer will be very close, but diferent in perhaps the last 2 decimal places. This is because Version 6 prints the regression coefficients to more decimal places than Version 8 does.

In the replication sample, what proportion of the variation in Price Willing to Pay is explained by each of the three predicted Y variables? Your answer is three numbers. Which model appears to be best?

We'd like a firmer conclusion. Here's a straightforward way to get one. Your absolute value variables represent inaccuracy of prediction. The one with the largest population mean corresponds to the worst model. So, do all pairwise matched t-tests. Protecting all three tests with a Bonferroni correction, what do you conclude? Which prediction model would you recommend?

Bring your log and list files to the quiz. Your list file contains many significance tests. You should be able to interpret all of them (well, except maybe for some of the t-tests for the intercepts).

Look carefully at what happened in the last 2 steps of the run yielding Model 1. Why did those last 2 variables make it into Model 1 but not into Model 2? Why is it a little bit surprising?