Revised STA442/1008 Assignment 7

Quiz on Friday Oct. 30th in tutorial


Important Note: This assignment was revised on Monday Oct. 26th! Please delete the variable total number of people in household. (It does not appear in the list of predictors below any more).

This assignment is based upon lecture material on stepwise regression and replication of a prediction model. The dependent variable is Price willing to pay for cable TV. You will treat this as quantitative throghout most of the assignment, but a reality check is in order. Start with a nice simple frequency distriution of the dependent variable, using the TV data from earlier assignments. What percent of respondents said they were unwilling to pay any money at all for cable TV? (This was 25 years ago.)

Now start by doing stepwise regression on the TV data. This is the exploratory sample. By stepwise, I mean the method that combines forward and backward selection. The variable you are trying to predict is Price willing to pay for cable TV. The potential predictor variables are

Do variable selection on the exploratory data in three ways, yielding three different regression models; call them Model One, Model Two and Model Three.
  1. Using the default significance levels to enter or remove a variable from the model.
  2. Set both significance levels to 0.05.
  3. Just use the first three variables to enter the stepwise models. I suggest using proc reg, not proc stepwise for this one, though you do get the same answer.

Look carefully at what happened at each stage of the stepwise selection. For example, in Model One, Step 9: What happened to the p-value of the variable that entered on Step 8? Why did this not happen in Model Two?

For each of the three models, you have a regression equation; the coefficients are estimated from the exploratory data. Now we will see how good they are, using a second independent sample from the same population. The replication data are available in the file tv2.data. The variables are the same as in tv1.dat. Of course for either data file, if you get NOTE: Invalid data, you must fix the problem. Check your log files!

For each of the three models, compute a new variable using the replication data -- predicted Y. That is, you are using the regression equations from the exploratory data to generate three predicted values of price willing to pay, for every case in the replication data file. Do not round the predicted values yet.

Now generate 3 more variables; each one is the absolute value of the difference between observed and predicted Y. Small values represent accurate prediction.

You have done a lot of computation now, and possibly you are unsure whether you have it right. As a check, for the absolute value of the difference between observed and predicted Price Willing to Pay from Model 2, I get a sample standard deviation of 1.7029626.

In the replication sample, what proportion of the variation in Price Willing to Pay is explained by each of the three predicted Y variables? Your answer is three numbers. Which model appears to be best?

We'd like a firmer conclusion. Here's a straightforward way to get one. Your absolute value variables represent inaccuracy of prediction. The one with the largest population mean corresponds to the worst model. So, do all pairwise matched t-tests. Protecting all three tests with a Bonferroni correction, what do you conclude? Which prediction model would you recommend at this point?

Now it's time to use common sense to refine the prediction. Remember, the possible values of Price Willing to Pay are 0, 5, 10, 15, 20, 25. How many predicted Y values from your best model are negative? (Oops. My answer is 40.) So fix it up. Make one more variable, final predicted price willing to pay. It must only take the values 0, 5, 10, 15, 20, 25. Make a table of Price Willing to Pay by this variable --- using the replication data, of course. Lots of questions are possible. For example, for those respondents in the replication sample with a predicted price willing to pay of $10 per month, what percent said they were willing to pay $15?

Bring your log and list files to the quiz. Your list file contains many significance tests. You should be able to interpret all of them (well, except maybe for some of the t-tests for the intercepts).