nuclear
require(SMPracticals); data(nuclear); head(nuclear)
## Loading required package: SMPracticals
## Loading required package: ellipse
##
## Attaching package: 'ellipse'
## The following object is masked from 'package:graphics':
##
## pairs
## cost date t1 t2 cap pr ne ct bw cum.n pt
## 1 460.05 68.58 14 46 687 0 1 0 0 14 0
## 2 452.99 67.33 10 73 1065 0 0 1 0 1 0
## 3 443.22 67.33 10 85 1065 1 0 1 0 1 0
## 4 652.32 68.00 11 67 1065 0 1 1 0 12 0
## 5 642.23 68.00 11 78 1065 1 1 1 0 12 0
## 6 345.39 67.92 13 51 514 0 1 1 0 3 0
help(nuclear)
You can also read about the data in Table 8.13 of Davison:
Table 8.13 of SM
As explained in class, and in SM, it makes sense to take logs of the continuous variables: cost
, cap
, T1
, T2
, cum.n
(called \(N\) in SM).
The full model is:
nuclear.lm <- lm(log(cost) ~ date + log(t1) + log(t2) + log(cap) + pr + ne + ct + bw + log(cum.n) + pt, data = nuclear)
summary(nuclear.lm)
##
## Call:
## lm(formula = log(cost) ~ date + log(t1) + log(t2) + log(cap) +
## pr + ne + ct + bw + log(cum.n) + pt, data = nuclear)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.28574 -0.10408 0.02784 0.09512 0.25031
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -14.24198 4.22880 -3.368 0.00291 **
## date 0.20922 0.06526 3.206 0.00425 **
## log(t1) 0.09187 0.24396 0.377 0.71025
## log(t2) 0.28553 0.27289 1.046 0.30731
## log(cap) 0.69373 0.13605 5.099 4.75e-05 ***
## pr -0.09237 0.07730 -1.195 0.24542
## ne 0.25807 0.07693 3.355 0.00300 **
## ct 0.12040 0.06632 1.815 0.08376 .
## bw 0.03303 0.10112 0.327 0.74715
## log(cum.n) -0.08020 0.04596 -1.745 0.09562 .
## pt -0.22429 0.12246 -1.832 0.08125 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1645 on 21 degrees of freedom
## Multiple R-squared: 0.8717, Adjusted R-squared: 0.8106
## F-statistic: 14.27 on 10 and 21 DF, p-value: 3.081e-07
I’ll let you fill in the details, but backwards elimination, taking out the least signficant variable until the remaning variables are significant, leads to
nuclear.lm2 <- lm(log(cost) ~ date + log(cap) + ne + ct + log(cum.n) + pt, data = nuclear)
summary(nuclear.lm2)
##
## Call:
## lm(formula = log(cost) ~ date + log(cap) + ne + ct + log(cum.n) +
## pt, data = nuclear)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.32721 -0.07620 0.02920 0.08115 0.28946
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -13.26031 3.13950 -4.224 0.000278 ***
## date 0.21241 0.04326 4.910 4.70e-05 ***
## log(cap) 0.72341 0.11882 6.088 2.31e-06 ***
## ne 0.24902 0.07414 3.359 0.002510 **
## ct 0.14039 0.06042 2.323 0.028582 *
## log(cum.n) -0.08758 0.04147 -2.112 0.044891 *
## pt -0.22610 0.11355 -1.991 0.057490 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1592 on 25 degrees of freedom
## Multiple R-squared: 0.8569, Adjusted R-squared: 0.8225
## F-statistic: 24.95 on 6 and 25 DF, p-value: 2.058e-09
You can compare this to SM Table 8.14:
Cox and Snell note that adding back in the omitted variables (in a different order) doesn’t lead to an improved model. Using step
will pick a model based on the smallest value of \(AIC\):
nuclear.lm2 <- step(nuclear.lm)
## Start: AIC=-107
## log(cost) ~ date + log(t1) + log(t2) + log(cap) + pr + ne + ct +
## bw + log(cum.n) + pt
##
## Df Sum of Sq RSS AIC
## - bw 1 0.00289 0.57091 -108.840
## - log(t1) 1 0.00384 0.57186 -108.787
## - log(t2) 1 0.02961 0.59764 -107.376
## <none> 0.56803 -107.002
## - pr 1 0.03862 0.60665 -106.897
## - log(cum.n) 1 0.08236 0.65038 -104.670
## - ct 1 0.08915 0.65718 -104.337
## - pt 1 0.09073 0.65876 -104.260
## - date 1 0.27800 0.84603 -96.254
## - ne 1 0.30440 0.87242 -95.271
## - log(cap) 1 0.70325 1.27128 -83.223
##
## Step: AIC=-108.84
## log(cost) ~ date + log(t1) + log(t2) + log(cap) + pr + ne + ct +
## log(cum.n) + pt
##
## Df Sum of Sq RSS AIC
## - log(t1) 1 0.00208 0.57300 -110.724
## <none> 0.57091 -108.840
## - log(t2) 1 0.04532 0.61623 -108.396
## - pr 1 0.04610 0.61702 -108.355
## - log(cum.n) 1 0.07959 0.65050 -106.664
## - ct 1 0.08627 0.65718 -106.337
## - pt 1 0.08805 0.65897 -106.250
## - ne 1 0.30504 0.87595 -97.142
## - date 1 0.32231 0.89322 -96.517
## - log(cap) 1 0.70330 1.27421 -85.149
##
## Step: AIC=-110.72
## log(cost) ~ date + log(t2) + log(cap) + pr + ne + ct + log(cum.n) +
## pt
##
## Df Sum of Sq RSS AIC
## <none> 0.57300 -110.724
## - log(t2) 1 0.04354 0.61654 -110.380
## - pr 1 0.04404 0.61703 -110.354
## - ct 1 0.08439 0.65739 -108.327
## - log(cum.n) 1 0.08554 0.65854 -108.271
## - pt 1 0.08842 0.66142 -108.132
## - ne 1 0.30677 0.87977 -99.003
## - date 1 0.66616 1.23915 -88.042
## - log(cap) 1 0.70765 1.28064 -86.988
This model includes log(T2)
and pr
as well, and we can compare the estimated coefficients and their standard errors:
summary(nuclear.lm2)
##
## Call:
## lm(formula = log(cost) ~ date + log(t2) + log(cap) + pr + ne +
## ct + log(cum.n) + pt, data = nuclear)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.29131 -0.09935 0.02178 0.09351 0.24800
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -15.22561 3.37328 -4.514 0.000156 ***
## date 0.22722 0.04394 5.171 3.06e-05 ***
## log(t2) 0.30186 0.22833 1.322 0.199155
## log(cap) 0.68246 0.12805 5.330 2.07e-05 ***
## pr -0.09336 0.07022 -1.330 0.196709
## ne 0.25895 0.07379 3.509 0.001886 **
## ct 0.11462 0.06227 1.841 0.078631 .
## log(cum.n) -0.07873 0.04249 -1.853 0.076751 .
## pt -0.21572 0.11451 -1.884 0.072280 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1578 on 23 degrees of freedom
## Multiple R-squared: 0.8706, Adjusted R-squared: 0.8256
## F-statistic: 19.34 on 8 and 23 DF, p-value: 1.709e-08
This is typical, AIC tends to pick out slightly bigger models than stepwise regression.
Another model-building trick is to fit all possible subsets of the covariates; of course we can only do this with relatively few covariates. There is a pretty fast algorithm for this in the leaps
package called regsubsets
: see LM-1 p.149 and LM-2 p.154.
Finally, Cox & Snell suggest that it would be good to see if the 6 observations with pt
\(=1\) have a different relation with cost, which can be assessed with interaction terms pt*
\(z\), where \(z\) is one of the other variables retained. We can’t include all the interaction terms in a single model, but we can add them one at a time. For example:
nuclear.lm3 <- update(nuclear.lm2, . ~ . + pt*log(cap))
summary(nuclear.lm3)
##
## Call:
## lm(formula = log(cost) ~ date + log(t2) + log(cap) + pr + ne +
## ct + log(cum.n) + pt + log(cap):pt, data = nuclear)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.29357 -0.09831 0.01726 0.09470 0.25058
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -15.05672 3.49188 -4.312 0.000282 ***
## date 0.22540 0.04528 4.977 5.56e-05 ***
## log(t2) 0.29726 0.23357 1.273 0.216417
## log(cap) 0.67870 0.13133 5.168 3.51e-05 ***
## pr -0.09363 0.07167 -1.306 0.204931
## ne 0.25832 0.07534 3.429 0.002401 **
## ct 0.11451 0.06356 1.802 0.085299 .
## log(cum.n) -0.07835 0.04338 -1.806 0.084597 .
## pt -1.90065 5.83497 -0.326 0.747703
## log(cap):pt 0.25054 0.86745 0.289 0.775422
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1611 on 22 degrees of freedom
## Multiple R-squared: 0.8711, Adjusted R-squared: 0.8184
## F-statistic: 16.52 on 9 and 22 DF, p-value: 7.592e-08
There’s not much change in the other coefficients, so a final model could be nuclear.lm2
or even nuclear.lm3
. Residual and diagnostic plots don’t show any surprises.
par(mfrow = c(2,2))
plot(nuclear.lm2)