I’m sure there are nicer ways to set things up, but I’m using a default version, except I added scipen
, to force the output not to use exponential notation, which I hate.
Faraway asks first to fit a model with all interactions, so here it is:
data(esoph)
head(esoph)
## agegp alcgp tobgp ncases ncontrols
## 1 25-34 0-39g/day 0-9g/day 0 40
## 2 25-34 0-39g/day 10-19 0 10
## 3 25-34 0-39g/day 20-29 0 6
## 4 25-34 0-39g/day 30+ 0 5
## 5 25-34 40-79 0-9g/day 0 27
## 6 25-34 40-79 10-19 0 7
help(esoph)
xtabs(ncases ~ agegp+tobgp, data = esoph)
## tobgp
## agegp 0-9g/day 10-19 20-29 30+
## 25-34 0 1 0 0
## 35-44 2 4 3 0
## 45-54 14 13 8 11
## 55-64 25 23 12 16
## 65-74 31 12 10 2
## 75+ 6 5 0 2
There could be more elegant ways to tabulate the data, but this is at least feasible to look at. In the help file for xtabs
this dataset is used as an example. And I learned about “flat tables”.
ftable(xtabs(cbind(ncases,ncontrols)~ . , data=esoph))
## ncases ncontrols
## agegp alcgp tobgp
## 25-34 0-39g/day 0-9g/day 0 40
## 10-19 0 10
## 20-29 0 6
## 30+ 0 5
## 40-79 0-9g/day 0 27
## 10-19 0 7
## 20-29 0 4
## 30+ 0 7
## 80-119 0-9g/day 0 2
## 10-19 0 1
## 20-29 0 0
## 30+ 0 2
## 120+ 0-9g/day 0 1
## 10-19 1 1
## 20-29 0 1
## 30+ 0 2
## 35-44 0-39g/day 0-9g/day 0 60
## 10-19 1 14
## 20-29 0 7
## 30+ 0 8
## 40-79 0-9g/day 0 35
## 10-19 3 23
## 20-29 1 14
## 30+ 0 8
## 80-119 0-9g/day 0 11
## 10-19 0 6
## 20-29 0 2
## 30+ 0 1
## 120+ 0-9g/day 2 3
## 10-19 0 3
## 20-29 2 4
## 30+ 0 0
## 45-54 0-39g/day 0-9g/day 1 46
## 10-19 0 18
## 20-29 0 10
## 30+ 0 4
## 40-79 0-9g/day 6 38
## 10-19 4 21
## 20-29 5 15
## 30+ 5 7
## 80-119 0-9g/day 3 16
## 10-19 6 14
## 20-29 1 5
## 30+ 2 4
## 120+ 0-9g/day 4 4
## 10-19 3 4
## 20-29 2 3
## 30+ 4 4
## 55-64 0-39g/day 0-9g/day 2 49
## 10-19 3 22
## 20-29 3 12
## 30+ 4 6
## 40-79 0-9g/day 9 40
## 10-19 6 21
## 20-29 4 17
## 30+ 3 6
## 80-119 0-9g/day 9 18
## 10-19 8 15
## 20-29 3 6
## 30+ 4 4
## 120+ 0-9g/day 5 10
## 10-19 6 7
## 20-29 2 3
## 30+ 5 6
## 65-74 0-39g/day 0-9g/day 5 48
## 10-19 4 14
## 20-29 2 7
## 30+ 0 2
## 40-79 0-9g/day 17 34
## 10-19 3 10
## 20-29 5 9
## 30+ 0 0
## 80-119 0-9g/day 6 13
## 10-19 4 12
## 20-29 2 3
## 30+ 1 1
## 120+ 0-9g/day 3 4
## 10-19 1 2
## 20-29 1 1
## 30+ 1 1
## 75+ 0-39g/day 0-9g/day 1 18
## 10-19 2 6
## 20-29 0 0
## 30+ 1 3
## 40-79 0-9g/day 2 5
## 10-19 1 3
## 20-29 0 3
## 30+ 1 1
## 80-119 0-9g/day 1 1
## 10-19 1 1
## 20-29 0 0
## 30+ 0 0
## 120+ 0-9g/day 2 2
## 10-19 1 1
## 20-29 0 0
## 30+ 0 0
ftable(xtabs(cbind(ncases,ncontrols) ~ agegp, data = esoph))
## ncases ncontrols
## agegp
## 25-34 1 116
## 35-44 9 199
## 45-54 46 213
## 55-64 76 242
## 65-74 55 161
## 75+ 13 44
ftable(xtabs(cbind(ncases,ncontrols) ~ agegp + tobgp, data = esoph))
## ncases ncontrols
## agegp tobgp
## 25-34 0-9g/day 0 70
## 10-19 1 19
## 20-29 0 11
## 30+ 0 16
## 35-44 0-9g/day 2 109
## 10-19 4 46
## 20-29 3 27
## 30+ 0 17
## 45-54 0-9g/day 14 104
## 10-19 13 57
## 20-29 8 33
## 30+ 11 19
## 55-64 0-9g/day 25 117
## 10-19 23 65
## 20-29 12 38
## 30+ 16 22
## 65-74 0-9g/day 31 99
## 10-19 12 38
## 20-29 10 20
## 30+ 2 4
## 75+ 0-9g/day 6 26
## 10-19 5 11
## 20-29 0 3
## 30+ 2 4
And now I’ll try the full model as requeseted in (a):
farmod1 <- glm(cbind(ncases, ncontrols) ~ agegp * tobgp * alcgp, data = esoph, family = binomial)
summary(farmod1)
##
## Call:
## glm(formula = cbind(ncases, ncontrols) ~ agegp * tobgp * alcgp,
## family = binomial, data = esoph)
##
## Deviance Residuals:
## [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [26] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [51] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [76] 0 0 0 0 0 0 0 0 0 0 0 0 0
##
## Coefficients: (8 not defined because of singularities)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -16.472 448746.504 0 1
## agegp.L -15.569 1618575.986 0 1
## agegp.Q -40.252 1484227.866 0 1
## agegp.C -14.684 1008976.267 0 1
## agegp^4 -12.303 547448.216 0 1
## agegp^5 0.127 195587.489 0 1
## tobgp.L -19.546 994718.397 0 1
## tobgp.Q -13.090 508437.111 0 1
## tobgp.C -3.141 272907.496 0 1
## alcgp.L -23.834 910490.633 0 1
## alcgp.Q -13.667 537896.363 0 1
## alcgp.C -7.679 446814.589 0 1
## agegp.L:tobgp.L -93.887 3715371.184 0 1
## agegp.Q:tobgp.L -77.269 3347188.027 0 1
## agegp.C:tobgp.L -33.534 2229132.399 0 1
## agegp^4:tobgp.L -41.093 1340669.776 0 1
## agegp^5:tobgp.L 5.377 401101.346 0 1
## agegp.L:tobgp.Q -58.813 2176812.828 0 1
## agegp.Q:tobgp.Q -53.162 1889690.079 0 1
## agegp.C:tobgp.Q -20.832 1093745.212 0 1
## agegp^4:tobgp.Q -22.200 913350.652 0 1
## agegp^5:tobgp.Q -1.313 162562.540 0 1
## agegp.L:tobgp.C -25.432 1209684.626 0 1
## agegp.Q:tobgp.C -14.233 1051280.542 0 1
## agegp.C:tobgp.C -7.456 533120.604 0 1
## agegp^4:tobgp.C -9.394 538600.635 0 1
## agegp^5:tobgp.C 2.459 40537.146 0 1
## agegp.L:alcgp.L -139.374 3510599.248 0 1
## agegp.Q:alcgp.L -121.145 3133477.378 0 1
## agegp.C:alcgp.L -59.010 2001861.710 0 1
## agegp^4:alcgp.L -56.014 1313419.645 0 1
## agegp^5:alcgp.L -12.241 337434.120 0 1
## agegp.L:alcgp.Q -73.145 2344368.743 0 1
## agegp.Q:alcgp.Q -52.208 2045544.146 0 1
## agegp.C:alcgp.Q -26.446 1141314.446 0 1
## agegp^4:alcgp.Q -37.422 969777.570 0 1
## agegp^5:alcgp.Q 9.039 149111.766 0 1
## agegp.L:alcgp.C -52.932 1799046.068 0 1
## agegp.Q:alcgp.C -43.309 1599967.759 0 1
## agegp.C:alcgp.C -15.277 924318.738 0 1
## agegp^4:alcgp.C -23.491 669602.012 0 1
## agegp^5:alcgp.C -1.180 109910.517 0 1
## tobgp.L:alcgp.L -55.524 2111355.720 0 1
## tobgp.Q:alcgp.L -10.529 1190467.469 0 1
## tobgp.C:alcgp.L 13.875 451398.285 0 1
## tobgp.L:alcgp.Q -41.199 1270833.457 0 1
## tobgp.Q:alcgp.Q -31.930 777774.355 0 1
## tobgp.C:alcgp.Q -17.179 399500.665 0 1
## tobgp.L:alcgp.C -22.339 995093.905 0 1
## tobgp.Q:alcgp.C -12.308 415519.343 0 1
## tobgp.C:alcgp.C 0.424 26466.259 0 1
## agegp.L:tobgp.L:alcgp.L -284.038 8518464.949 0 1
## agegp.Q:tobgp.L:alcgp.L -255.110 7548330.307 0 1
## agegp.C:tobgp.L:alcgp.L -119.080 4618168.407 0 1
## agegp^4:tobgp.L:alcgp.L -135.525 3302207.688 0 1
## agegp^5:tobgp.L:alcgp.L -11.605 821017.435 0 1
## agegp.L:tobgp.Q:alcgp.L -103.297 5251283.868 0 1
## agegp.Q:tobgp.Q:alcgp.L -86.885 4573489.954 0 1
## agegp.C:tobgp.Q:alcgp.L -13.858 2546319.838 0 1
## agegp^4:tobgp.Q:alcgp.L -83.605 2211972.272 0 1
## agegp^5:tobgp.Q:alcgp.L 19.694 451884.902 0 1
## agegp.L:tobgp.C:alcgp.L 28.810 2178231.735 0 1
## agegp.Q:tobgp.C:alcgp.L 37.313 1871275.690 0 1
## agegp.C:tobgp.C:alcgp.L 18.220 904268.476 0 1
## agegp^4:tobgp.C:alcgp.L 11.687 994993.359 0 1
## agegp^5:tobgp.C:alcgp.L NA NA NA NA
## agegp.L:tobgp.L:alcgp.Q -193.300 5799765.283 0 1
## agegp.Q:tobgp.L:alcgp.Q -161.854 5002186.199 0 1
## agegp.C:tobgp.L:alcgp.Q -77.169 2660725.551 0 1
## agegp^4:tobgp.L:alcgp.Q -88.452 2509144.512 0 1
## agegp^5:tobgp.L:alcgp.Q 7.564 263437.728 0 1
## agegp.L:tobgp.Q:alcgp.Q -163.091 3876596.317 0 1
## agegp.Q:tobgp.Q:alcgp.Q -144.727 3275262.932 0 1
## agegp.C:tobgp.Q:alcgp.Q -53.939 1605508.369 0 1
## agegp^4:tobgp.Q:alcgp.Q -74.727 1773263.500 0 1
## agegp^5:tobgp.Q:alcgp.Q NA NA NA NA
## agegp.L:tobgp.C:alcgp.Q -86.204 1832255.441 0 1
## agegp.Q:tobgp.C:alcgp.Q -64.657 1511051.950 0 1
## agegp.C:tobgp.C:alcgp.Q -39.782 891255.821 0 1
## agegp^4:tobgp.C:alcgp.Q -33.337 803627.917 0 1
## agegp^5:tobgp.C:alcgp.Q NA NA NA NA
## agegp.L:tobgp.L:alcgp.C -108.648 4072433.764 0 1
## agegp.Q:tobgp.L:alcgp.C -93.993 3572958.455 0 1
## agegp.C:tobgp.L:alcgp.C -44.130 2055374.262 0 1
## agegp^4:tobgp.L:alcgp.C -44.701 1524151.088 0 1
## agegp^5:tobgp.L:alcgp.C -3.678 228037.948 0 1
## agegp.L:tobgp.Q:alcgp.C -44.180 1984146.123 0 1
## agegp.Q:tobgp.Q:alcgp.C -42.372 1645734.736 0 1
## agegp.C:tobgp.Q:alcgp.C -22.024 802652.993 0 1
## agegp^4:tobgp.Q:alcgp.C -19.893 817172.909 0 1
## agegp^5:tobgp.Q:alcgp.C NA NA NA NA
## agegp.L:tobgp.C:alcgp.C -2.805 221432.610 0 1
## agegp.Q:tobgp.C:alcgp.C NA NA NA NA
## agegp.C:tobgp.C:alcgp.C NA NA NA NA
## agegp^4:tobgp.C:alcgp.C NA NA NA NA
## agegp^5:tobgp.C:alcgp.C NA NA NA NA
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 227.24055853245858 on 87 degrees of freedom
## Residual deviance: 0.00000000030116 on 0 degrees of freedom
## AIC: 323.5
##
## Number of Fisher Scoring iterations: 25
So we probably didn’t really want this. We’ve fit a parameter to every observation, not helpful. What is with all the .L, .C etc?
is.ordered(esoph$agegp)
## [1] TRUE
Ordered factors use polynomial contrasts to fit the various levels. This is a kind of compromise between treating the values as continuous and treating them as completely unordered. In the help
file for this dataset you’ll see models with covariates unclass(tobgp)
and unclass(alcgp)
.
Now Faraway suggests using backward interaction to simplify the model “as far as is reasonable”.
farmod2 <- step(farmod1, direction = "backward")
## Start: AIC=323.5
## cbind(ncases, ncontrols) ~ agegp * tobgp * alcgp
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Df Deviance AIC
## - agegp:tobgp:alcgp 37 16.1 266
## <none> 0.0 323
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
##
## Step: AIC=265.6
## cbind(ncases, ncontrols) ~ agegp + tobgp + alcgp + agegp:tobgp +
## agegp:alcgp + tobgp:alcgp
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Df Deviance AIC
## - agegp:tobgp 15 27.1 247
## - agegp:alcgp 15 34.4 254
## - tobgp:alcgp 9 23.8 255
## <none> 16.1 266
##
## Step: AIC=246.6
## cbind(ncases, ncontrols) ~ agegp + tobgp + alcgp + agegp:alcgp +
## tobgp:alcgp
##
## Df Deviance AIC
## - tobgp:alcgp 9 33.8 235
## - agegp:alcgp 15 47.5 237
## <none> 27.1 247
##
## Step: AIC=235.3
## cbind(ncases, ncontrols) ~ agegp + tobgp + alcgp + agegp:alcgp
##
## Df Deviance AIC
## - agegp:alcgp 15 54.0 226
## <none> 33.8 235
## - tobgp 3 44.2 240
##
## Step: AIC=225.4
## cbind(ncases, ncontrols) ~ agegp + tobgp + alcgp
##
## Df Deviance AIC
## <none> 54.0 226
## - tobgp 3 64.6 230
## - alcgp 3 120.0 286
## - agegp 5 131.5 293
summary(farmod2)
##
## Call:
## glm(formula = cbind(ncases, ncontrols) ~ agegp + tobgp + alcgp,
## family = binomial, data = esoph)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.689 -0.562 -0.217 0.231 2.064
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.7800 0.1980 -8.99 < 2e-16 ***
## agegp.L 3.0053 0.6522 4.61 0.000004058789115 ***
## agegp.Q -1.3379 0.5911 -2.26 0.0236 *
## agegp.C 0.1531 0.4485 0.34 0.7329
## agegp^4 0.0641 0.3088 0.21 0.8356
## agegp^5 -0.1936 0.1954 -0.99 0.3216
## tobgp.L 0.5945 0.1942 3.06 0.0022 **
## tobgp.Q 0.0654 0.1881 0.35 0.7282
## tobgp.C 0.1568 0.1866 0.84 0.4007
## alcgp.L 1.4919 0.1994 7.48 0.000000000000072 ***
## alcgp.Q -0.2266 0.1795 -1.26 0.2068
## alcgp.C 0.2546 0.1591 1.60 0.1094
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 227.241 on 87 degrees of freedom
## Residual deviance: 53.973 on 76 degrees of freedom
## AIC: 225.5
##
## Number of Fisher Scoring iterations: 6
This eliminated all the interactions, but still used orthogonal polynomials for the variables. And up to 5th degree for age! Doesn’t qutie make sense. We could either treat the variables as continuous, or as unordered factors.
farmod3 <- glm(cbind(ncases, ncontrols) ~ unclass(agegp) + unclass(alcgp) + unclass(tobgp), family = binomial, data = esoph)
summary(farmod3)
##
## Call:
## glm(formula = cbind(ncases, ncontrols) ~ unclass(agegp) + unclass(alcgp) +
## unclass(tobgp), family = binomial, data = esoph)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.959 -0.856 -0.436 0.307 1.935
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.5959 0.4154 -13.47 < 2e-16 ***
## unclass(agegp) 0.5287 0.0719 7.36 0.00000000000019 ***
## unclass(alcgp) 0.6938 0.0834 8.32 < 2e-16 ***
## unclass(tobgp) 0.2745 0.0807 3.40 0.00068 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 227.241 on 87 degrees of freedom
## Residual deviance: 73.959 on 84 degrees of freedom
## AIC: 229.4
##
## Number of Fisher Scoring iterations: 4
This seems a bit too simple, treating each covariate as continuous and just fitting a linear model. Myabe we could try linear in age, but with tobacco and alcohol as factors. Below are several versions, they can be compared using anova
, or their residual deviances can be compared.
esoph2 <- mutate(esoph,age = factor(agegp,ordered=FALSE), tob = factor(tobgp, ordered = F), alc = factor(alcgp, ordered = F) )
# new dataset has unordered factors
farmod4 <- glm(cbind(ncases, ncontrols) ~ unclass(agegp) + tob*alc, data = esoph2, family = binomial)
farmod5 <- glm(cbind(ncases, ncontrols) ~ unclass(agegp) + tob + alc, data = esoph2, family = binomial)
farmod6 <- glm(cbind(ncases, ncontrols) ~ agegp + tob + alc, data = esoph2, family = binomial)
This example in Chapter 3 of FELM is used to introduce Poisson regression. Faraway first considers ordinary linear regression, and notes that the residuals don’t seem to have constant variance. Back in the day we’d take logs or square root of the response and then fit linear regression again, but a more modern approach is to fit a Poisson glm.
The residual deviance in a Poisson model is a test of fit of the model, as it is with a Binomial. The second fit below allows an inflation factor on the variance, while still using the simplicity of the Poisson for fitting.
data(gala)
head(gala)
## Species Endemics Area Elevation Nearest Scruz Adjacent
## Baltra 58 23 25.09 346 0.6 0.6 1.84
## Bartolome 31 21 1.24 109 0.6 26.3 572.33
## Caldwell 3 3 0.21 114 2.8 58.7 0.78
## Champion 25 9 0.10 46 1.9 47.4 0.18
## Coamano 2 1 0.05 77 1.9 1.9 903.82
## Daphne.Major 18 11 0.34 119 8.0 8.0 1.84
help(gala)
gal.glm <- glm(Species ~ . - Endemics,
family = poisson,
data = gala)
summary(gal.glm)
##
## Call:
## glm(formula = Species ~ . - Endemics, family = poisson, data = gala)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -8.275 -4.497 -0.944 1.917 10.185
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.1548079 0.0517495 60.96 < 2e-16 ***
## Area -0.0005799 0.0000263 -22.07 < 2e-16 ***
## Elevation 0.0035406 0.0000874 40.51 < 2e-16 ***
## Nearest 0.0088256 0.0018213 4.85 0.0000013 ***
## Scruz -0.0057094 0.0006256 -9.13 < 2e-16 ***
## Adjacent -0.0006630 0.0000293 -22.61 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 3510.73 on 29 degrees of freedom
## Residual deviance: 716.85 on 24 degrees of freedom
## AIC: 889.7
##
## Number of Fisher Scoring iterations: 5
gal.quasi <- glm(Species ~ . - Endemics,
family = quasipoisson,
data = gala)
summary(gal.quasi)
##
## Call:
## glm(formula = Species ~ . - Endemics, family = quasipoisson,
## data = gala)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -8.275 -4.497 -0.944 1.917 10.185
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.154808 0.291590 10.82 0.0000000001 ***
## Area -0.000580 0.000148 -3.92 0.00065 ***
## Elevation 0.003541 0.000493 7.19 0.0000001983 ***
## Nearest 0.008826 0.010262 0.86 0.39829
## Scruz -0.005709 0.003525 -1.62 0.11838
## Adjacent -0.000663 0.000165 -4.01 0.00051 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for quasipoisson family taken to be 31.75)
##
## Null deviance: 3510.73 on 29 degrees of freedom
## Residual deviance: 716.85 on 24 degrees of freedom
## AIC: NA
##
## Number of Fisher Scoring iterations: 5
plot(gal.quasi$linear.predictors, gal.quasi$residuals)
The canonical link for the Poisson is the log link. Identity links are sometimes used as well; see the
cloth
data analysis in SM Example 10.28.
data(cloth)
head(cloth)
## x y
## 1 1.22 1
## 2 1.70 4
## 3 2.71 5
## 4 3.71 14
## 5 3.72 7
## 6 3.75 9
with(cloth, plot(x,y)) # Figure 10.11
clothmod1 <- glm(y ~ x - 1, family = poisson(link = identity), data = cloth)
summary(clothmod1)
##
## Call:
## glm(formula = y ~ x - 1, family = poisson(link = identity), data = cloth)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.817 -1.127 -0.237 0.627 3.438
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## x 1.5102 0.0896 16.9 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: Inf on 32 degrees of freedom
## Residual deviance: 64.537 on 31 degrees of freedom
## AIC: 189.8
##
## Number of Fisher Scoring iterations: 3
clothmod2 <- glm(y ~ x - 1, family = quasipoisson(link=identity), data = cloth )
summary(clothmod2)
##
## Call:
## glm(formula = y ~ x - 1, family = quasipoisson(link = identity),
## data = cloth)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.817 -1.127 -0.237 0.627 3.438
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## x 1.510 0.133 11.4 0.0000000000014 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for quasipoisson family taken to be 2.194)
##
## Null deviance: Inf on 32 degrees of freedom
## Residual deviance: 64.537 on 31 degrees of freedom
## AIC: NA
##
## Number of Fisher Scoring iterations: 3
plot(clothmod2)
In SM (p. 511) the suggestion is made that a route to an overdispersed Poisson would be to model \(Y\) as Poisson with mean \(\mu\epsilon\), where \(\epsilon\) is a latent random variable with expected value 1. If a gamma distribution with shape \(\nu\) and mean 1 is postulated, then the marginal distribution of \(Y\) is negative binomial. The negative binomial model can be fit by maximum likelihood using
MASS::glm.nb
; see FELM 3.3.
I will use the chimp
data set analysed in SM, Example 10.16.
data(chimps)
head(chimps)
## chimp word y
## 1 1 1 178
## 2 1 2 60
## 3 1 3 177
## 4 1 4 36
## 5 1 5 225
## 6 1 6 345
The response is time to learn each of 10 words, in minutes, and there are four chimpanzees. See p.485 of SM where the data are set out in a randomized block format. Davison’s first analysis is a simple lm(y ~ chimp + word, data = chimps)
, and then he applies Tukey’s 1 df test for non-additivity (see Example 8.24) to confirm that the additive model is not a good fit.
model1 <- glm(y ~ chimp + word, family=Gamma(link = "log"), data = chimps)
options(digits=6)
anova(model1)
## Analysis of Deviance Table
##
## Model: Gamma, link: log
##
## Response: y
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev
## NULL 39 60.38
## chimp 3 6.95 36 53.43
## word 9 38.46 27 14.97
This reproduces the first half of Table 10.6. The other side is the same thing with the factors introduced in the opposite order.
model2 <- glm(y ~ word + chimp, family = Gamma(link = "log"), data = chimps)
summary(model2)
##
## Call:
## glm(formula = y ~ word + chimp, family = Gamma(link = "log"),
## data = chimps)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6766 -0.4510 -0.0488 0.3074 1.2847
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.458 0.375 14.54 0.000000000000027 ***
## word2 -1.771 0.466 -3.80 0.00074 ***
## word3 -0.365 0.466 -0.78 0.44007
## word4 -1.821 0.466 -3.91 0.00056 ***
## word5 -1.102 0.466 -2.37 0.02541 *
## word6 0.279 0.466 0.60 0.55461
## word7 -1.725 0.466 -3.70 0.00096 ***
## word8 -2.555 0.466 -5.49 0.000008278115332 ***
## word9 0.811 0.466 1.74 0.09301 .
## word10 -0.514 0.466 -1.10 0.27972
## chimp2 -0.973 0.295 -3.31 0.00269 **
## chimp3 -0.808 0.295 -2.74 0.01068 *
## chimp4 -0.113 0.295 -0.38 0.70479
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Gamma family taken to be 0.433666)
##
## Null deviance: 60.378 on 39 degrees of freedom
## Residual deviance: 14.972 on 27 degrees of freedom
## AIC: 418.4
##
## Number of Fisher Scoring iterations: 12
anova(model2)
## Analysis of Deviance Table
##
## Model: Gamma, link: log
##
## Response: y
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev
## NULL 39 60.38
## word 9 39.19 30 21.19
## chimp 3 6.22 27 14.97
It makes a bit of difference, but not substantial. To test for a chimp
effect we need an estimate of the dispersion parameter, which is obtained from summary
as 0.434 (Davison reports .432), needs checking. Perhaps summary
used deviance instead of Pearson \(X^2\)?)
The chimp
test statistic is \((6.22/3)/0.434 = 4.77\) which gives a \(p\)-value of \(0.008\) when referred to an \(F_{3, 27}\) distribution. The word
test statistic is \((39.19/9)/0.434\) about \(10^{-6}\).
test1 <- (6.22/3)/0.434
pf(test1, 3, 27, lower.tail = F)
## [1] 0.00849023
test2 <- (38.46/9)/0.434
pf(test2, 9, 27, lower.tail = F)
## [1] 0.00000166957
Davison suggests two other approaches to this data: an ordinary normal theory model fit to \(\log(y)\), and a gamma model with the canonical link \(1/\mu\). I’ll let you explore those on your own.