Sep222021.Rmd

Continuing with prostate example (see also Sep 15 Rmd file). Here we will look mainly at F-Tests, as well as factor variables.

First, model 1 is the original linear model with all the covariates:

model1 <- lm(lpsa ~ ., data = prostate)
summary(model1)

## 
## Call:
## lm(formula = lpsa ~ ., data = prostate)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -1.733 -0.371 -0.017  0.414  1.638 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.66934    1.29639    0.52   0.6069    
## lcavol       0.58702    0.08792    6.68  2.1e-09 ***
## lweight      0.45447    0.17001    2.67   0.0090 ** 
## age         -0.01964    0.01117   -1.76   0.0823 .  
## lbph         0.10705    0.05845    1.83   0.0704 .  
## svi          0.76616    0.24431    3.14   0.0023 ** 
## lcp         -0.10547    0.09101   -1.16   0.2496    
## gleason      0.04514    0.15746    0.29   0.7750    
## pgg45        0.00453    0.00442    1.02   0.3089    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.708 on 88 degrees of freedom
## Multiple R-squared:  0.655,  Adjusted R-squared:  0.623 
## F-statistic: 20.9 on 8 and 88 DF,  p-value: <2e-16

summary(prostate$gleason)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.00    6.00    7.00    6.75    7.00    9.00

prostate$gleason

##  [1] 6 6 7 6 6 6 6 6 6 6 6 6 7 7 7 6 7 6 6 6 6 7 6 7 6 6 7 7 7 6 6 6 6 6 6 7 8 7
## [39] 7 7 9 7 6 7 7 7 9 7 6 6 7 6 7 7 7 7 7 6 7 7 6 7 9 7 6 7 7 7 6 7 7 7 7 9 7 7
## [77] 7 7 7 7 7 7 7 9 7 7 6 7 7 7 6 7 7 7 7 7 7

We should have done some data exploration first, but speeding along I’ll just note that the variable gleason has just 4 values. I’ll make a new variable that treats this as a factor variable

prostate2 <- prostate[,-7]
prostate2$gleason_factor <- as.factor(prostate$gleason)
summary(prostate2$gleason)

##  6  7  8  9 
## 35 56  1  5

Note that this is a different type of variable, but you can’t really tell this from the data frame:

head(prostate2)

##    lcavol lweight age   lbph svi    lcp pgg45    lpsa gleason_factor
## 1 -0.5798   2.769  50 -1.386   0 -1.386     0 -0.4308              6
## 2 -0.9943   3.320  58 -1.386   0 -1.386     0 -0.1625              6
## 3 -0.5108   2.691  74 -1.386   0 -1.386    20 -0.1625              7
## 4 -1.2040   3.283  58 -1.386   0 -1.386     0 -0.1625              6
## 5  0.7514   3.432  62 -1.386   0 -1.386     0  0.3716              6
## 6 -1.0498   3.229  50 -1.386   0 -1.386     0  0.7655              6

You can tell though with is.factor(prostate2$gleason_factor), for example.

model2 <- lm(lpsa ~ ., data = prostate2)
summary(model2)

## 
## Call:
## lm(formula = lpsa ~ ., data = prostate2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.7482 -0.3504 -0.0263  0.4766  1.7026 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      0.91328    0.84084    1.09   0.2804    
## lcavol           0.56999    0.09010    6.33  1.1e-08 ***
## lweight          0.46879    0.16961    2.76   0.0070 ** 
## age             -0.02175    0.01136   -1.91   0.0589 .  
## lbph             0.09968    0.05898    1.69   0.0946 .  
## svi              0.74588    0.24740    3.01   0.0034 ** 
## lcp             -0.12511    0.09559   -1.31   0.1941    
## pgg45            0.00499    0.00467    1.07   0.2885    
## gleason_factor7  0.26761    0.21942    1.22   0.2259    
## gleason_factor8  0.49682    0.76927    0.65   0.5201    
## gleason_factor9 -0.05621    0.50020   -0.11   0.9108    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.705 on 86 degrees of freedom
## Multiple R-squared:  0.666,  Adjusted R-squared:  0.627 
## F-statistic: 17.1 on 10 and 86 DF,  p-value: <2e-16

model3 <- update(model2, . ~ . - gleason_factor)# same model as before, but minus g_f
summary(model3)

## 
## Call:
## lm(formula = lpsa ~ lcavol + lweight + age + lbph + svi + lcp + 
##     pgg45, data = prostate2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.7312 -0.3814 -0.0173  0.4336  1.6351 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.95393    0.82944    1.15   0.2532    
## lcavol       0.59161    0.08600    6.88  8.1e-10 ***
## lweight      0.44829    0.16777    2.67   0.0090 ** 
## age         -0.01934    0.01107   -1.75   0.0840 .  
## lbph         0.10767    0.05811    1.85   0.0672 .  
## svi          0.75773    0.24128    3.14   0.0023 ** 
## lcp         -0.10448    0.09048   -1.15   0.2513    
## pgg45        0.00532    0.00343    1.55   0.1249    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.705 on 89 degrees of freedom
## Multiple R-squared:  0.654,  Adjusted R-squared:  0.627 
## F-statistic: 24.1 on 7 and 89 DF,  p-value: <2e-16

Note that we now have used 3 parameters to fit the factor variable; each parameter measures the difference between the baseline level (here “6”) and the others. This is very suspect, there is only 1 observation with a score of 8, and very few with scores of 9.

anova(model2,model3) # anova(model3,model2) looks a little nicer

## Analysis of Variance Table
## 
## Model 1: lpsa ~ lcavol + lweight + age + lbph + svi + lcp + pgg45 + gleason_factor
## Model 2: lpsa ~ lcavol + lweight + age + lbph + svi + lcp + pgg45
##   Res.Df  RSS Df Sum of Sq    F Pr(>F)
## 1     86 42.7                         
## 2     89 44.2 -3     -1.48 0.99    0.4

The effect of gleason_factor is assessed by an $F$-test, with $3$ and $88$ numerator and denominator degrees of freedom, respectively.

This is one example of comparing two models. Another example done in class is less interesting:

anova(model1,lm(lpsa ~ lcavol + lweight + age + svi + lbph, data = prostate ))

## Analysis of Variance Table
## 
## Model 1: lpsa ~ lcavol + lweight + age + lbph + svi + lcp + gleason + 
##     pgg45
## Model 2: lpsa ~ lcavol + lweight + age + svi + lbph
##   Res.Df  RSS Df Sum of Sq    F Pr(>F)
## 1     88 44.2                         
## 2     91 45.5 -3     -1.36 0.91   0.44

The omitted variables do not reduce the residual SS very much, but these were chosen based on the output of the first fitted model, so this test is not very useful. Neither is the test that all coefficients are equal to zero (it has a p-value of $10^{-6}$).

Another setting where $F$-tests are needed, as opposed to just the summary output from lm, is when fitting “smooth” functions of covariates. As you will recall from “IJALM”, these are also linear regression models, even though they’re sometimes described as nonparametric.

require("splines")

## Loading required package: splines

library(splines)
ns(prostate$age,4)[1:10,] # first 10 rows of this constructed variable

##             1        2       3        4
##  [1,] 0.05921 -0.17424 0.39601 -0.22176
##  [2,] 0.39904 -0.17586 0.39968 -0.22382
##  [3,] 0.04272  0.38670 0.31389  0.25669
##  [4,] 0.39904 -0.17586 0.39968 -0.22382
##  [5,] 0.73114 -0.07380 0.19165 -0.10732
##  [6,] 0.05921 -0.17424 0.39601 -0.22176
##  [7,] 0.81980  0.04199 0.09596 -0.05373
##  [8,] 0.39904 -0.17586 0.39968 -0.22382
##  [9,] 0.01754 -0.12787 0.29060 -0.16274
## [10,] 0.79379 -0.02448 0.13638 -0.07637

model4 <- lm(lpsa ~ .-age + ns(age,4), data = prostate, )
summary(model4)

## 
## Call:
## lm(formula = lpsa ~ . - age + ns(age, 4), data = prostate)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.8437 -0.4229 -0.0372  0.4252  1.4976 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.05753    1.26504    0.05   0.9638    
## lcavol       0.60394    0.08812    6.85  1.1e-09 ***
## lweight      0.43909    0.17113    2.57   0.0120 *  
## lbph         0.11258    0.05852    1.92   0.0577 .  
## svi          0.76936    0.24612    3.13   0.0024 ** 
## lcp         -0.12894    0.09138   -1.41   0.1619    
## gleason      0.04644    0.16252    0.29   0.7758    
## pgg45        0.00547    0.00449    1.22   0.2266    
## ns(age, 4)1 -0.96113    0.42946   -2.24   0.0278 *  
## ns(age, 4)2 -0.14599    0.36090   -0.40   0.6868    
## ns(age, 4)3 -1.08083    1.06296   -1.02   0.3121    
## ns(age, 4)4 -1.11145    0.50159   -2.22   0.0294 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.703 on 85 degrees of freedom
## Multiple R-squared:  0.672,  Adjusted R-squared:  0.629 
## F-statistic: 15.8 on 11 and 85 DF,  p-value: 2.58e-16

Note that although some of the coefficients have a "*", it doesn’t make sense to look at them individually, only as a group.

anova(model4,lm(lpsa ~ . - age, data = prostate))

## Analysis of Variance Table
## 
## Model 1: lpsa ~ (lcavol + lweight + age + lbph + svi + lcp + gleason + 
##     pgg45) - age + ns(age, 4)
## Model 2: lpsa ~ (lcavol + lweight + age + lbph + svi + lcp + gleason + 
##     pgg45) - age
##   Res.Df  RSS Df Sum of Sq    F Pr(>F)
## 1     85 42.0                         
## 2     89 45.7 -4     -3.73 1.89   0.12

Have a look at model.matrix of model4 to see the extra 4 columns that got added with the ns command.

There was a question about the difference between unordered and ordered factors. Ordered factors are treated differently in lm and other methods built on lm: the model matrix includes values that correspond to linear, quadratic, cubic, etc. polynomials in the underlying variable. The coefficients of these linear, quadratic, cubic etc. polynomials are computed on the fly. (See the documentation for contrasts.)

is.ordered(prostate2$gleason_factor)

## [1] FALSE

prostate2$go <- as.ordered(prostate2$gleason_factor)
head(prostate2)

##    lcavol lweight age   lbph svi    lcp pgg45    lpsa gleason_factor go
## 1 -0.5798   2.769  50 -1.386   0 -1.386     0 -0.4308              6  6
## 2 -0.9943   3.320  58 -1.386   0 -1.386     0 -0.1625              6  6
## 3 -0.5108   2.691  74 -1.386   0 -1.386    20 -0.1625              7  7
## 4 -1.2040   3.283  58 -1.386   0 -1.386     0 -0.1625              6  6
## 5  0.7514   3.432  62 -1.386   0 -1.386     0  0.3716              6  6
## 6 -1.0498   3.229  50 -1.386   0 -1.386     0  0.7655              6  6

model5 <- lm(lpsa ~ . - gleason_factor, data = prostate2)
summary(model5)

## 
## Call:
## lm(formula = lpsa ~ . - gleason_factor, data = prostate2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.7482 -0.3504 -0.0263  0.4766  1.7026 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.09034    0.88515    1.23   0.2214    
## lcavol       0.56999    0.09010    6.33  1.1e-08 ***
## lweight      0.46879    0.16961    2.76   0.0070 ** 
## age         -0.02175    0.01136   -1.91   0.0589 .  
## lbph         0.09968    0.05898    1.69   0.0946 .  
## svi          0.74588    0.24740    3.01   0.0034 ** 
## lcp         -0.12511    0.09559   -1.31   0.1941    
## pgg45        0.00499    0.00467    1.07   0.2885    
## go.L         0.01354    0.36284    0.04   0.9703    
## go.Q        -0.41032    0.42866   -0.96   0.3411    
## go.C        -0.16633    0.53519   -0.31   0.7567    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.705 on 86 degrees of freedom
## Multiple R-squared:  0.666,  Adjusted R-squared:  0.627 
## F-statistic: 17.1 on 10 and 86 DF,  p-value: <2e-16

X2 <- model.matrix(model5)
X2[1:10,]

##    (Intercept)  lcavol lweight age    lbph svi    lcp pgg45    go.L go.Q
## 1            1 -0.5798   2.769  50 -1.3863   0 -1.386     0 -0.6708  0.5
## 2            1 -0.9943   3.320  58 -1.3863   0 -1.386     0 -0.6708  0.5
## 3            1 -0.5108   2.691  74 -1.3863   0 -1.386    20 -0.2236 -0.5
## 4            1 -1.2040   3.283  58 -1.3863   0 -1.386     0 -0.6708  0.5
## 5            1  0.7514   3.432  62 -1.3863   0 -1.386     0 -0.6708  0.5
## 6            1 -1.0498   3.229  50 -1.3863   0 -1.386     0 -0.6708  0.5
## 7            1  0.7372   3.474  64  0.6152   0 -1.386     0 -0.6708  0.5
## 8            1  0.6931   3.539  58  1.5369   0 -1.386     0 -0.6708  0.5
## 9            1 -0.7765   3.539  47 -1.3863   0 -1.386     0 -0.6708  0.5
## 10           1  0.2231   3.244  63 -1.3863   0 -1.386     0 -0.6708  0.5
##       go.C
## 1  -0.2236
## 2  -0.2236
## 3   0.6708
## 4  -0.2236
## 5  -0.2236
## 6  -0.2236
## 7  -0.2236
## 8  -0.2236
## 9  -0.2236
## 10 -0.2236

contrasts(prostate2$go)

##           .L   .Q      .C
## [1,] -0.6708  0.5 -0.2236
## [2,] -0.2236 -0.5  0.6708
## [3,]  0.2236 -0.5 -0.6708
## [4,]  0.6708  0.5  0.2236

Note the 3 terms for gleason_factor are now called .L, .Q, .C, for linear, quadratic, cubic. These could in principle be tested one at a time, because they represent successively more complicated relationships between Gleason score and lpsa. But this data is not very suitable for illustrating ordered factors, so I’ll come back to it when it is more relevant.

Sep222021.Rmd

Nancy

22/09/2021