Continuing with prostate example (see also Sep 15 Rmd file). Here we will look mainly at F-Tests, as well as factor variables.
First, model 1 is the original linear model with all the covariates:
model1 <- lm(lpsa ~ ., data = prostate)
summary(model1)
##
## Call:
## lm(formula = lpsa ~ ., data = prostate)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.733 -0.371 -0.017 0.414 1.638
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.66934 1.29639 0.52 0.6069
## lcavol 0.58702 0.08792 6.68 2.1e-09 ***
## lweight 0.45447 0.17001 2.67 0.0090 **
## age -0.01964 0.01117 -1.76 0.0823 .
## lbph 0.10705 0.05845 1.83 0.0704 .
## svi 0.76616 0.24431 3.14 0.0023 **
## lcp -0.10547 0.09101 -1.16 0.2496
## gleason 0.04514 0.15746 0.29 0.7750
## pgg45 0.00453 0.00442 1.02 0.3089
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.708 on 88 degrees of freedom
## Multiple R-squared: 0.655, Adjusted R-squared: 0.623
## F-statistic: 20.9 on 8 and 88 DF, p-value: <2e-16
summary(prostate$gleason)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.00 6.00 7.00 6.75 7.00 9.00
prostate$gleason
## [1] 6 6 7 6 6 6 6 6 6 6 6 6 7 7 7 6 7 6 6 6 6 7 6 7 6 6 7 7 7 6 6 6 6 6 6 7 8 7
## [39] 7 7 9 7 6 7 7 7 9 7 6 6 7 6 7 7 7 7 7 6 7 7 6 7 9 7 6 7 7 7 6 7 7 7 7 9 7 7
## [77] 7 7 7 7 7 7 7 9 7 7 6 7 7 7 6 7 7 7 7 7 7
We should have done some data exploration first, but speeding along I’ll just note that the variable gleason
has just 4 values. I’ll make a new variable that treats this as a factor variable
prostate2 <- prostate[,-7]
prostate2$gleason_factor <- as.factor(prostate$gleason)
summary(prostate2$gleason)
## 6 7 8 9
## 35 56 1 5
Note that this is a different type of variable, but you can’t really tell this from the data frame:
head(prostate2)
## lcavol lweight age lbph svi lcp pgg45 lpsa gleason_factor
## 1 -0.5798 2.769 50 -1.386 0 -1.386 0 -0.4308 6
## 2 -0.9943 3.320 58 -1.386 0 -1.386 0 -0.1625 6
## 3 -0.5108 2.691 74 -1.386 0 -1.386 20 -0.1625 7
## 4 -1.2040 3.283 58 -1.386 0 -1.386 0 -0.1625 6
## 5 0.7514 3.432 62 -1.386 0 -1.386 0 0.3716 6
## 6 -1.0498 3.229 50 -1.386 0 -1.386 0 0.7655 6
You can tell though with is.factor(prostate2$gleason_factor)
, for example.
model2 <- lm(lpsa ~ ., data = prostate2)
summary(model2)
##
## Call:
## lm(formula = lpsa ~ ., data = prostate2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.7482 -0.3504 -0.0263 0.4766 1.7026
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.91328 0.84084 1.09 0.2804
## lcavol 0.56999 0.09010 6.33 1.1e-08 ***
## lweight 0.46879 0.16961 2.76 0.0070 **
## age -0.02175 0.01136 -1.91 0.0589 .
## lbph 0.09968 0.05898 1.69 0.0946 .
## svi 0.74588 0.24740 3.01 0.0034 **
## lcp -0.12511 0.09559 -1.31 0.1941
## pgg45 0.00499 0.00467 1.07 0.2885
## gleason_factor7 0.26761 0.21942 1.22 0.2259
## gleason_factor8 0.49682 0.76927 0.65 0.5201
## gleason_factor9 -0.05621 0.50020 -0.11 0.9108
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.705 on 86 degrees of freedom
## Multiple R-squared: 0.666, Adjusted R-squared: 0.627
## F-statistic: 17.1 on 10 and 86 DF, p-value: <2e-16
model3 <- update(model2, . ~ . - gleason_factor)# same model as before, but minus g_f
summary(model3)
##
## Call:
## lm(formula = lpsa ~ lcavol + lweight + age + lbph + svi + lcp +
## pgg45, data = prostate2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.7312 -0.3814 -0.0173 0.4336 1.6351
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.95393 0.82944 1.15 0.2532
## lcavol 0.59161 0.08600 6.88 8.1e-10 ***
## lweight 0.44829 0.16777 2.67 0.0090 **
## age -0.01934 0.01107 -1.75 0.0840 .
## lbph 0.10767 0.05811 1.85 0.0672 .
## svi 0.75773 0.24128 3.14 0.0023 **
## lcp -0.10448 0.09048 -1.15 0.2513
## pgg45 0.00532 0.00343 1.55 0.1249
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.705 on 89 degrees of freedom
## Multiple R-squared: 0.654, Adjusted R-squared: 0.627
## F-statistic: 24.1 on 7 and 89 DF, p-value: <2e-16
Note that we now have used 3 parameters to fit the factor variable; each parameter measures the difference between the baseline level (here “6”) and the others. This is very suspect, there is only 1 observation with a score of 8, and very few with scores of 9.
anova(model2,model3) # anova(model3,model2) looks a little nicer
## Analysis of Variance Table
##
## Model 1: lpsa ~ lcavol + lweight + age + lbph + svi + lcp + pgg45 + gleason_factor
## Model 2: lpsa ~ lcavol + lweight + age + lbph + svi + lcp + pgg45
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 86 42.7
## 2 89 44.2 -3 -1.48 0.99 0.4
The effect of gleason_factor
is assessed by an \(F\)-test, with \(3\) and \(88\) numerator and denominator degrees of freedom, respectively.
This is one example of comparing two models. Another example done in class is less interesting:
anova(model1,lm(lpsa ~ lcavol + lweight + age + svi + lbph, data = prostate ))
## Analysis of Variance Table
##
## Model 1: lpsa ~ lcavol + lweight + age + lbph + svi + lcp + gleason +
## pgg45
## Model 2: lpsa ~ lcavol + lweight + age + svi + lbph
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 88 44.2
## 2 91 45.5 -3 -1.36 0.91 0.44
The omitted variables do not reduce the residual SS very much, but these were chosen based on the output of the first fitted model, so this test is not very useful. Neither is the test that all coefficients are equal to zero (it has a p-value of \(10^{-6}\)).
Another setting where \(F\)-tests are needed, as opposed to just the summary output from lm
, is when fitting “smooth” functions of covariates. As you will recall from “IJALM”, these are also linear regression models, even though they’re sometimes described as nonparametric.
require("splines")
## Loading required package: splines
library(splines)
ns(prostate$age,4)[1:10,] # first 10 rows of this constructed variable
## 1 2 3 4
## [1,] 0.05921 -0.17424 0.39601 -0.22176
## [2,] 0.39904 -0.17586 0.39968 -0.22382
## [3,] 0.04272 0.38670 0.31389 0.25669
## [4,] 0.39904 -0.17586 0.39968 -0.22382
## [5,] 0.73114 -0.07380 0.19165 -0.10732
## [6,] 0.05921 -0.17424 0.39601 -0.22176
## [7,] 0.81980 0.04199 0.09596 -0.05373
## [8,] 0.39904 -0.17586 0.39968 -0.22382
## [9,] 0.01754 -0.12787 0.29060 -0.16274
## [10,] 0.79379 -0.02448 0.13638 -0.07637
model4 <- lm(lpsa ~ .-age + ns(age,4), data = prostate, )
summary(model4)
##
## Call:
## lm(formula = lpsa ~ . - age + ns(age, 4), data = prostate)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.8437 -0.4229 -0.0372 0.4252 1.4976
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.05753 1.26504 0.05 0.9638
## lcavol 0.60394 0.08812 6.85 1.1e-09 ***
## lweight 0.43909 0.17113 2.57 0.0120 *
## lbph 0.11258 0.05852 1.92 0.0577 .
## svi 0.76936 0.24612 3.13 0.0024 **
## lcp -0.12894 0.09138 -1.41 0.1619
## gleason 0.04644 0.16252 0.29 0.7758
## pgg45 0.00547 0.00449 1.22 0.2266
## ns(age, 4)1 -0.96113 0.42946 -2.24 0.0278 *
## ns(age, 4)2 -0.14599 0.36090 -0.40 0.6868
## ns(age, 4)3 -1.08083 1.06296 -1.02 0.3121
## ns(age, 4)4 -1.11145 0.50159 -2.22 0.0294 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.703 on 85 degrees of freedom
## Multiple R-squared: 0.672, Adjusted R-squared: 0.629
## F-statistic: 15.8 on 11 and 85 DF, p-value: 2.58e-16
Note that although some of the coefficients have a "*", it doesn’t make sense to look at them individually, only as a group.
anova(model4,lm(lpsa ~ . - age, data = prostate))
## Analysis of Variance Table
##
## Model 1: lpsa ~ (lcavol + lweight + age + lbph + svi + lcp + gleason +
## pgg45) - age + ns(age, 4)
## Model 2: lpsa ~ (lcavol + lweight + age + lbph + svi + lcp + gleason +
## pgg45) - age
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 85 42.0
## 2 89 45.7 -4 -3.73 1.89 0.12
Have a look at model.matrix
of model4 to see the extra 4 columns that got added with the ns
command.
There was a question about the difference between unordered and ordered factors. Ordered factors are treated differently in lm
and other methods built on lm
: the model matrix includes values that correspond to linear, quadratic, cubic, etc. polynomials in the underlying variable. The coefficients of these linear, quadratic, cubic etc. polynomials are computed on the fly. (See the documentation for contrasts
.)
is.ordered(prostate2$gleason_factor)
## [1] FALSE
prostate2$go <- as.ordered(prostate2$gleason_factor)
head(prostate2)
## lcavol lweight age lbph svi lcp pgg45 lpsa gleason_factor go
## 1 -0.5798 2.769 50 -1.386 0 -1.386 0 -0.4308 6 6
## 2 -0.9943 3.320 58 -1.386 0 -1.386 0 -0.1625 6 6
## 3 -0.5108 2.691 74 -1.386 0 -1.386 20 -0.1625 7 7
## 4 -1.2040 3.283 58 -1.386 0 -1.386 0 -0.1625 6 6
## 5 0.7514 3.432 62 -1.386 0 -1.386 0 0.3716 6 6
## 6 -1.0498 3.229 50 -1.386 0 -1.386 0 0.7655 6 6
model5 <- lm(lpsa ~ . - gleason_factor, data = prostate2)
summary(model5)
##
## Call:
## lm(formula = lpsa ~ . - gleason_factor, data = prostate2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.7482 -0.3504 -0.0263 0.4766 1.7026
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.09034 0.88515 1.23 0.2214
## lcavol 0.56999 0.09010 6.33 1.1e-08 ***
## lweight 0.46879 0.16961 2.76 0.0070 **
## age -0.02175 0.01136 -1.91 0.0589 .
## lbph 0.09968 0.05898 1.69 0.0946 .
## svi 0.74588 0.24740 3.01 0.0034 **
## lcp -0.12511 0.09559 -1.31 0.1941
## pgg45 0.00499 0.00467 1.07 0.2885
## go.L 0.01354 0.36284 0.04 0.9703
## go.Q -0.41032 0.42866 -0.96 0.3411
## go.C -0.16633 0.53519 -0.31 0.7567
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.705 on 86 degrees of freedom
## Multiple R-squared: 0.666, Adjusted R-squared: 0.627
## F-statistic: 17.1 on 10 and 86 DF, p-value: <2e-16
X2 <- model.matrix(model5)
X2[1:10,]
## (Intercept) lcavol lweight age lbph svi lcp pgg45 go.L go.Q
## 1 1 -0.5798 2.769 50 -1.3863 0 -1.386 0 -0.6708 0.5
## 2 1 -0.9943 3.320 58 -1.3863 0 -1.386 0 -0.6708 0.5
## 3 1 -0.5108 2.691 74 -1.3863 0 -1.386 20 -0.2236 -0.5
## 4 1 -1.2040 3.283 58 -1.3863 0 -1.386 0 -0.6708 0.5
## 5 1 0.7514 3.432 62 -1.3863 0 -1.386 0 -0.6708 0.5
## 6 1 -1.0498 3.229 50 -1.3863 0 -1.386 0 -0.6708 0.5
## 7 1 0.7372 3.474 64 0.6152 0 -1.386 0 -0.6708 0.5
## 8 1 0.6931 3.539 58 1.5369 0 -1.386 0 -0.6708 0.5
## 9 1 -0.7765 3.539 47 -1.3863 0 -1.386 0 -0.6708 0.5
## 10 1 0.2231 3.244 63 -1.3863 0 -1.386 0 -0.6708 0.5
## go.C
## 1 -0.2236
## 2 -0.2236
## 3 0.6708
## 4 -0.2236
## 5 -0.2236
## 6 -0.2236
## 7 -0.2236
## 8 -0.2236
## 9 -0.2236
## 10 -0.2236
contrasts(prostate2$go)
## .L .Q .C
## [1,] -0.6708 0.5 -0.2236
## [2,] -0.2236 -0.5 0.6708
## [3,] 0.2236 -0.5 -0.6708
## [4,] 0.6708 0.5 0.2236
Note the 3 terms for gleason_factor
are now called .L
, .Q
, .C
, for linear, quadratic, cubic. These could in principle be tested one at a time, because they represent successively more complicated relationships between Gleason score and lpsa
. But this data is not very suitable for illustrating ordered factors, so I’ll come back to it when it is more relevant.