October 6 model building with nuclear

require(SMPracticals); data(nuclear); head(nuclear)
## Loading required package: SMPracticals
## Loading required package: ellipse
## 
## Attaching package: 'ellipse'
## The following object is masked from 'package:graphics':
## 
##     pairs
##     cost  date t1 t2  cap pr ne ct bw cum.n pt
## 1 460.05 68.58 14 46  687  0  1  0  0    14  0
## 2 452.99 67.33 10 73 1065  0  0  1  0     1  0
## 3 443.22 67.33 10 85 1065  1  0  1  0     1  0
## 4 652.32 68.00 11 67 1065  0  1  1  0    12  0
## 5 642.23 68.00 11 78 1065  1  1  1  0    12  0
## 6 345.39 67.92 13 51  514  0  1  1  0     3  0
help(nuclear) 

You can also read about the data in Table 8.13 of Davison:

Table 8.13 of SM

As explained in class, and in SM, it makes sense to take logs of the continuous variables: cost, cap, T1, T2, cum.n (called \(N\) in SM).

The full model is:

nuclear.lm <- lm(log(cost) ~ date + log(t1) + log(t2) + log(cap) + pr + ne + ct + bw + log(cum.n) + pt, data = nuclear)
summary(nuclear.lm)
## 
## Call:
## lm(formula = log(cost) ~ date + log(t1) + log(t2) + log(cap) + 
##     pr + ne + ct + bw + log(cum.n) + pt, data = nuclear)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.28574 -0.10408  0.02784  0.09512  0.25031 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -14.24198    4.22880  -3.368  0.00291 ** 
## date          0.20922    0.06526   3.206  0.00425 ** 
## log(t1)       0.09187    0.24396   0.377  0.71025    
## log(t2)       0.28553    0.27289   1.046  0.30731    
## log(cap)      0.69373    0.13605   5.099 4.75e-05 ***
## pr           -0.09237    0.07730  -1.195  0.24542    
## ne            0.25807    0.07693   3.355  0.00300 ** 
## ct            0.12040    0.06632   1.815  0.08376 .  
## bw            0.03303    0.10112   0.327  0.74715    
## log(cum.n)   -0.08020    0.04596  -1.745  0.09562 .  
## pt           -0.22429    0.12246  -1.832  0.08125 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1645 on 21 degrees of freedom
## Multiple R-squared:  0.8717, Adjusted R-squared:  0.8106 
## F-statistic: 14.27 on 10 and 21 DF,  p-value: 3.081e-07

I’ll let you fill in the details, but backwards elimination, taking out the least signficant variable until the remaning variables are significant, leads to

nuclear.lm2 <- lm(log(cost) ~ date + log(cap) + ne + ct + log(cum.n) + pt, data = nuclear)
summary(nuclear.lm2)
## 
## Call:
## lm(formula = log(cost) ~ date + log(cap) + ne + ct + log(cum.n) + 
##     pt, data = nuclear)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.32721 -0.07620  0.02920  0.08115  0.28946 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -13.26031    3.13950  -4.224 0.000278 ***
## date          0.21241    0.04326   4.910 4.70e-05 ***
## log(cap)      0.72341    0.11882   6.088 2.31e-06 ***
## ne            0.24902    0.07414   3.359 0.002510 ** 
## ct            0.14039    0.06042   2.323 0.028582 *  
## log(cum.n)   -0.08758    0.04147  -2.112 0.044891 *  
## pt           -0.22610    0.11355  -1.991 0.057490 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1592 on 25 degrees of freedom
## Multiple R-squared:  0.8569, Adjusted R-squared:  0.8225 
## F-statistic: 24.95 on 6 and 25 DF,  p-value: 2.058e-09

You can compare this to SM Table 8.14: Table 8.14 SM

Cox and Snell note that adding back in the omitted variables (in a different order) doesn’t lead to an improved model. Using step will pick a model based on the smallest value of \(AIC\):

nuclear.lm2 <- step(nuclear.lm)
## Start:  AIC=-107
## log(cost) ~ date + log(t1) + log(t2) + log(cap) + pr + ne + ct + 
##     bw + log(cum.n) + pt
## 
##              Df Sum of Sq     RSS      AIC
## - bw          1   0.00289 0.57091 -108.840
## - log(t1)     1   0.00384 0.57186 -108.787
## - log(t2)     1   0.02961 0.59764 -107.376
## <none>                    0.56803 -107.002
## - pr          1   0.03862 0.60665 -106.897
## - log(cum.n)  1   0.08236 0.65038 -104.670
## - ct          1   0.08915 0.65718 -104.337
## - pt          1   0.09073 0.65876 -104.260
## - date        1   0.27800 0.84603  -96.254
## - ne          1   0.30440 0.87242  -95.271
## - log(cap)    1   0.70325 1.27128  -83.223
## 
## Step:  AIC=-108.84
## log(cost) ~ date + log(t1) + log(t2) + log(cap) + pr + ne + ct + 
##     log(cum.n) + pt
## 
##              Df Sum of Sq     RSS      AIC
## - log(t1)     1   0.00208 0.57300 -110.724
## <none>                    0.57091 -108.840
## - log(t2)     1   0.04532 0.61623 -108.396
## - pr          1   0.04610 0.61702 -108.355
## - log(cum.n)  1   0.07959 0.65050 -106.664
## - ct          1   0.08627 0.65718 -106.337
## - pt          1   0.08805 0.65897 -106.250
## - ne          1   0.30504 0.87595  -97.142
## - date        1   0.32231 0.89322  -96.517
## - log(cap)    1   0.70330 1.27421  -85.149
## 
## Step:  AIC=-110.72
## log(cost) ~ date + log(t2) + log(cap) + pr + ne + ct + log(cum.n) + 
##     pt
## 
##              Df Sum of Sq     RSS      AIC
## <none>                    0.57300 -110.724
## - log(t2)     1   0.04354 0.61654 -110.380
## - pr          1   0.04404 0.61703 -110.354
## - ct          1   0.08439 0.65739 -108.327
## - log(cum.n)  1   0.08554 0.65854 -108.271
## - pt          1   0.08842 0.66142 -108.132
## - ne          1   0.30677 0.87977  -99.003
## - date        1   0.66616 1.23915  -88.042
## - log(cap)    1   0.70765 1.28064  -86.988

This model includes log(T2) and pr as well, and we can compare the estimated coefficients and their standard errors:

summary(nuclear.lm2)
## 
## Call:
## lm(formula = log(cost) ~ date + log(t2) + log(cap) + pr + ne + 
##     ct + log(cum.n) + pt, data = nuclear)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.29131 -0.09935  0.02178  0.09351  0.24800 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -15.22561    3.37328  -4.514 0.000156 ***
## date          0.22722    0.04394   5.171 3.06e-05 ***
## log(t2)       0.30186    0.22833   1.322 0.199155    
## log(cap)      0.68246    0.12805   5.330 2.07e-05 ***
## pr           -0.09336    0.07022  -1.330 0.196709    
## ne            0.25895    0.07379   3.509 0.001886 ** 
## ct            0.11462    0.06227   1.841 0.078631 .  
## log(cum.n)   -0.07873    0.04249  -1.853 0.076751 .  
## pt           -0.21572    0.11451  -1.884 0.072280 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1578 on 23 degrees of freedom
## Multiple R-squared:  0.8706, Adjusted R-squared:  0.8256 
## F-statistic: 19.34 on 8 and 23 DF,  p-value: 1.709e-08

This is typical, AIC tends to pick out slightly bigger models than stepwise regression.

Another model-building trick is to fit all possible subsets of the covariates; of course we can only do this with relatively few covariates. There is a pretty fast algorithm for this in the leaps package called regsubsets: see LM-1 p.149 and LM-2 p.154.

Finally, Cox & Snell suggest that it would be good to see if the 6 observations with pt\(=1\) have a different relation with cost, which can be assessed with interaction terms pt*\(z\), where \(z\) is one of the other variables retained. We can’t include all the interaction terms in a single model, but we can add them one at a time. For example:

nuclear.lm3 <- update(nuclear.lm2, . ~ . + pt*log(cap))
summary(nuclear.lm3)
## 
## Call:
## lm(formula = log(cost) ~ date + log(t2) + log(cap) + pr + ne + 
##     ct + log(cum.n) + pt + log(cap):pt, data = nuclear)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.29357 -0.09831  0.01726  0.09470  0.25058 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -15.05672    3.49188  -4.312 0.000282 ***
## date          0.22540    0.04528   4.977 5.56e-05 ***
## log(t2)       0.29726    0.23357   1.273 0.216417    
## log(cap)      0.67870    0.13133   5.168 3.51e-05 ***
## pr           -0.09363    0.07167  -1.306 0.204931    
## ne            0.25832    0.07534   3.429 0.002401 ** 
## ct            0.11451    0.06356   1.802 0.085299 .  
## log(cum.n)   -0.07835    0.04338  -1.806 0.084597 .  
## pt           -1.90065    5.83497  -0.326 0.747703    
## log(cap):pt   0.25054    0.86745   0.289 0.775422    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1611 on 22 degrees of freedom
## Multiple R-squared:  0.8711, Adjusted R-squared:  0.8184 
## F-statistic: 16.52 on 9 and 22 DF,  p-value: 7.592e-08

There’s not much change in the other coefficients, so a final model could be nuclear.lm2 or even nuclear.lm3. Residual and diagnostic plots don’t show any surprises.

par(mfrow = c(2,2))
plot(nuclear.lm2)