Figure 10.12 in SM plots the data, here’s a try at reproducing the right figure, along with linear, quadratic and cubic regression fits:
toxo.glm1 <- glm(cbind(r,m-r) ~ rain, family = binomial, data = toxo)
toxo.glm2 <- glm(cbind(r,m-r) ~ rain + I(rain^2), family = binomial, data = toxo)
toxo.glm3 <- glm(cbind(r,m-r) ~ rain + I(rain^2) + I(rain^3), family = binomial, data = toxo)
with(toxo, plot(rain, r/m, pch=16, xlab = "Rainfall (mm)", ylab = "proportion positive"))
with(toxo, lines(rain,toxo.glm1$fitted.values))
with(toxo, lines(rain,toxo.glm2$fitted.values, lty=2, col="red"))
with(toxo, lines(rain,toxo.glm3$fitted.values, lty=3, col="blue"))
The linear and quadratic fits are nearly identical:
summary(toxo.glm3)
##
## Call:
## glm(formula = cbind(r, m - r) ~ rain + I(rain^2) + I(rain^3),
## family = binomial, data = toxo)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.762 -1.217 -0.508 0.354 2.620
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -290.1742537061 87.2177635645 -3.33 0.00088 ***
## rain 0.4499754025 0.1346997655 3.34 0.00084 ***
## I(rain^2) -0.0002311282 0.0000690301 -3.35 0.00081 ***
## I(rain^3) 0.0000000393 0.0000000117 3.35 0.00081 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 74.212 on 33 degrees of freedom
## Residual deviance: 62.635 on 30 degrees of freedom
## AIC: 161.3
##
## Number of Fisher Scoring iterations: 3
These coefficients are hard to read – one of Cox & Donelly’s rules of thumb is that estimates should be on a comfortable scale.
succ <- toxo$r
failure <- toxo$m - toxo$r
rainfall <- toxo$rain/100
summary(glm(cbind(succ,failure) ~ rainfall + I(rainfall^2) + I(rainfall^3), family = binomial))
##
## Call:
## glm(formula = cbind(succ, failure) ~ rainfall + I(rainfall^2) +
## I(rainfall^3), family = binomial)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.762 -1.217 -0.508 0.354 2.620
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -290.1743 87.2178 -3.33 0.00088 ***
## rainfall 44.9975 13.4700 3.34 0.00084 ***
## I(rainfall^2) -2.3113 0.6903 -3.35 0.00081 ***
## I(rainfall^3) 0.0393 0.0117 3.35 0.00081 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 74.212 on 33 degrees of freedom
## Residual deviance: 62.635 on 30 degrees of freedom
## AIC: 161.3
##
## Number of Fisher Scoring iterations: 3
Very strange that all the \(z\)-values are the same! I think this might be because the correlation between rainfall and rainfall\(^2\) and rainfall\(^3\) is really high:
cor(rainfall,rainfall^2); cor(rainfall,rainfall^3)
## [1] 0.9987
## [1] 0.9951
This often happens with observational data. The function poly
will construct orthogonal polynomials so the estimates become uncorrelated. You can often do almost as well by centering the \(x\) variable first.
toxo$rainc <- toxo$rain/100 - mean(toxo$rain/100)
cor(toxo$rainc,toxo$rainc^2); cor(toxo$rainc,toxo$rainc^3)
## [1] 0.4186
## [1] 0.872
toxo.glm4 <- glm(cbind(r,m-r) ~ rainc + I(rainc^2) + I(rainc^3), family = quasibinomial, data = toxo)
toxo.glm5 <- glm(cbind(r,m-r) ~ poly(rain/100, 3), family = quasibinomial, data = toxo)
summary(toxo.glm4)
##
## Call:
## glm(formula = cbind(r, m - r) ~ rainc + I(rainc^2) + I(rainc^3),
## family = quasibinomial, data = toxo)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.762 -1.217 -0.508 0.354 2.620
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.0994 0.1420 0.70 0.489
## rainc -0.2552 0.1230 -2.08 0.047 *
## I(rainc^2) -0.0606 0.0413 -1.47 0.152
## I(rainc^3) 0.0393 0.0163 2.41 0.023 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for quasibinomial family taken to be 1.94)
##
## Null deviance: 74.212 on 33 degrees of freedom
## Residual deviance: 62.635 on 30 degrees of freedom
## AIC: NA
##
## Number of Fisher Scoring iterations: 3
summary(toxo.glm5)
##
## Call:
## glm(formula = cbind(r, m - r) ~ poly(rain/100, 3), family = quasibinomial,
## data = toxo)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.762 -1.217 -0.508 0.354 2.620
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.0243 0.1072 0.23 0.822
## poly(rain/100, 3)1 -0.0861 0.6390 -0.13 0.894
## poly(rain/100, 3)2 -0.1927 0.6511 -0.30 0.769
## poly(rain/100, 3)3 1.3787 0.5732 2.41 0.023 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for quasibinomial family taken to be 1.94)
##
## Null deviance: 74.212 on 33 degrees of freedom
## Residual deviance: 62.635 on 30 degrees of freedom
## AIC: NA
##
## Number of Fisher Scoring iterations: 3
plot(toxo$rain, toxo.glm5$fitted.values, pch="*", col="blue")
points(toxo$rain, toxo$r/toxo$m, col="grey")
There is some evidence of over-dispersion, let’s try:
toxo.quasi <- glm(cbind(succ,failure) ~ rainfall + I(rainfall^2) + I(rainfall^3), family = quasibinomial)
summary(toxo.quasi)
##
## Call:
## glm(formula = cbind(succ, failure) ~ rainfall + I(rainfall^2) +
## I(rainfall^3), family = quasibinomial)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.762 -1.217 -0.508 0.354 2.620
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -290.1743 121.4942 -2.39 0.023 *
## rainfall 44.9975 18.7637 2.40 0.023 *
## I(rainfall^2) -2.3113 0.9616 -2.40 0.023 *
## I(rainfall^3) 0.0393 0.0163 2.41 0.023 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for quasibinomial family taken to be 1.94)
##
## Null deviance: 74.212 on 33 degrees of freedom
## Residual deviance: 62.635 on 30 degrees of freedom
## AIC: NA
##
## Number of Fisher Scoring iterations: 3
The point estimates are the same, but the significance has reduced.
Davison goes on to fit local linear and local quadratic regression, using a version of locpoly
for generalized linear models. It is explained on in Example 10.32. I used the gam
package in mgcv
. Using this is an art in itself, and the documentation isn’t terribly helpful. Possibly better at this point to turn to An Introduction to Statistical Learning and follow their examples. The library gam
also has a gam
function, and it will do local polynomial fits using gam.lo
..
toxo.gam <- gam(cbind(r, m-r) ~ s(rain), family = binomial, data = toxo)
summary(toxo.gam)
##
## Family: binomial
## Link function: logit
##
## Formula:
## cbind(r, m - r) ~ s(rain)
##
## Parametric coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.0901 0.0857 -1.05 0.29
##
## Approximate significance of smooth terms:
## edf Ref.df Chi.sq p-value
## s(rain) 6.51 7.57 22.9 0.0027 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.221 Deviance explained = 36.7%
## UBRE = 0.82365 Scale est. = 1 n = 34
plot(toxo.gam)
points(toxo$rain, toxo$r/toxo$m, col="gray")
This looks wrong, and it is; the reason is in the fine print in the help file. The smooth function s(rain,6.51)
is plotted on the scale of the linear predictor, i.e. on the logit scale, but I plotted the data on the probability scale. Here’s a better plot:
plot(toxo.gam)
with(toxo, points(rain, logit(r/m), col="gray"))
Alternatively, you can plot on the probability scale using the inverse transform:
plot(toxo.gam, seWithMean=TRUE, trans= ilogit, main = "GAM fit, automatic smoothing")
with(toxo, points(rain, r/m, col="gray"))
Here’s a try of a smoother fit, with 3 degrees of freedom
toxo.gam2 <- gam(cbind(r, m-r) ~ s(rain,k=4), family = binomial, data = toxo) # k-1 is the approximate degrees of freedom used
summary(toxo.gam2)
##
## Family: binomial
## Link function: logit
##
## Formula:
## cbind(r, m - r) ~ s(rain, k = 4)
##
## Parametric coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.0254 0.0767 0.33 0.74
##
## Approximate significance of smooth terms:
## edf Ref.df Chi.sq p-value
## s(rain) 2.86 2.99 9.21 0.027 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.0717 Deviance explained = 14.1%
## UBRE = 1.1018 Scale est. = 1 n = 34
plot(toxo.gam2, seWithMean=TRUE, trans=ilogit, main="GAM fit, 3 df")
with(toxo, points(rain, r/m, col="gray"))
points(toxo$rain, toxo.glm5$fitted.values, pch="*", col="blue" )
The stars are the fitted values from a 3rd degree polynomial, not very different from this cubic spline fit, just a bit more dramatic at the endpoints.