The traditional expression of the linear model for a randomized block design is \[ y_{ij} = \mu + \alpha_i + \beta_j + \epsilon_{ij}, \quad j = 1, \dots, B; \quad i = 1, \dots T. \] Of course this can also be written \(y = X\beta + \epsilon\) for suitable choices of \(X\) and \(\beta\), but the model above is more intuitive. Note that under the assumption that \(\mathbb{E}(\epsilon_{ij})=0\), we have \[ \mathbb{E}(y_{ij}) = \mu + \alpha_i + \beta_j, \] which is our first clue that this model is over-parametrized, because we’ve expressed a single real number as the sum of three other numbers. A fancier way to describe the over-parametrization under the assumption that the \(\epsilon\) are i.i.d. normal with constant variance \(\sigma^2\) is to note that the likelihood function \(\ell(\mu, \alpha_1, \dots, \alpha_T, \beta_1, \dots, \beta_B, \sigma^2; y_{11}, \dots, y_{BT})\) has \(T+B+2\) parameters, but only \(T+B\) sufficient statistics. Even without the normal assumption, though, the design matrix \(X\) for this model as written above will be of the form \[ \begin{pmatrix} y_{11} \\ \vdots \\ y_{1B}\\ \vdots \\ y_{T1} \\ \vdots \\ y_{TB} \end{pmatrix} = \begin{pmatrix} 1 & 1 &0 & \dots & 1 & 0 \dots \\ 1 & 1 & 0 & \dots & 0 & 1 \dots \\ \vdots & &&& \vdots \\ 1 & 0 & \dots 1 & 0 & \dots & \dots &1 \end{pmatrix} \begin{pmatrix} \mu \\ \alpha_1 \\ \vdots \\ \alpha_T \\ \beta_1 \\ \vdots \\ \beta_B \end{pmatrix} + \begin{pmatrix} \epsilon_{11} \\ \vdots \\ \epsilon_{1B}\\ \vdots \\ \epsilon_{T1} \\ \vdots \\ \epsilon_{TB} \end{pmatrix} \] and this matrix has two redundant columns, because the sum of the 2nd through \(T+1\)st column is a column of \(1's\), and the sum of the last \(B\) columns is a column of \(1\)’s.
When you fit the model in R
using lm
, the \(X\) matrix is automatically made full rank by removing two columns, and the default is to remove the columns corresponding to \(\alpha_1\) and \(\beta_1\). This gives the analysis of variance table and estimates of coefficients in the R Markdown document for the lecture on October 8. This default can be accessed with the options
command: setting the baseline category to be the reference is called contr.treatment
, or the treatment contrast. (The second argument of contrasts is for ordered factors.)
options("contrasts")
## $contrasts
## unordered ordered
## "contr.treatment" "contr.poly"
data(oatvar)
twoway <- lm(yield ~ variety + block, data = oatvar)
summary(twoway)
##
## Call:
## lm(formula = yield ~ variety + block, data = oatvar)
##
## Residuals:
## Min 1Q Median 3Q Max
## -67.275 -19.675 -4.638 17.725 74.925
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 363.275 20.027 18.139 < 2e-16 ***
## variety2 42.200 23.125 1.825 0.078707 .
## variety3 28.200 23.125 1.219 0.232841
## variety4 -47.600 23.125 -2.058 0.048968 *
## variety5 105.000 23.125 4.541 9.73e-05 ***
## variety6 -3.800 23.125 -0.164 0.870656
## variety7 -16.000 23.125 -0.692 0.494701
## variety8 49.800 23.125 2.154 0.040035 *
## blockII -25.500 18.282 -1.395 0.174036
## blockIII 0.125 18.282 0.007 0.994593
## blockIV -42.000 18.282 -2.297 0.029282 *
## blockV -77.000 18.282 -4.212 0.000238 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 36.56 on 28 degrees of freedom
## Multiple R-squared: 0.7477, Adjusted R-squared: 0.6485
## F-statistic: 7.542 on 11 and 28 DF, p-value: 7.723e-06
The coefficient estimates are more intuitive if you use the so-called sum contrasts (\(\Sigma\alpha_i=0, \Sigma\beta_j=0\)):
options(contrasts=c("contr.sum","contr.poly"))
twoway2 <- lm(yield ~ variety + block, data = oatvar)
summary(twoway2)
##
## Call:
## lm(formula = yield ~ variety + block, data = oatvar)
##
## Residuals:
## Min 1Q Median 3Q Max
## -67.275 -19.675 -4.637 17.725 74.925
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 354.125 5.781 61.254 < 2e-16 ***
## variety1 -19.725 15.296 -1.290 0.207748
## variety2 22.475 15.296 1.469 0.152880
## variety3 8.475 15.296 0.554 0.583924
## variety4 -67.325 15.296 -4.402 0.000142 ***
## variety5 85.275 15.296 5.575 5.78e-06 ***
## variety6 -23.525 15.296 -1.538 0.135270
## variety7 -35.725 15.296 -2.336 0.026901 *
## block1 28.875 11.562 2.497 0.018666 *
## block2 3.375 11.562 0.292 0.772520
## block3 29.000 11.562 2.508 0.018208 *
## block4 -13.125 11.562 -1.135 0.265941
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 36.56 on 28 degrees of freedom
## Multiple R-squared: 0.7477, Adjusted R-squared: 0.6485
## F-statistic: 7.542 on 11 and 28 DF, p-value: 7.723e-06
Although the estimated coefficients change, the analysis of variance tables are unchanged. Also unchanged are any differences between varieties (or blocks).
The estimates under the sum constraints are: \(\hat\mu = \bar y_{..}, \hat\alpha_i = \bar y_{i.}-\bar y_{..}, \hat\beta_j = \bar y_{.j}-\bar y_{..}\):
ybar <- mean(oatvar$yield)
with(oatvar, tapply(yield,variety,mean)) - ybar
## 1 2 3 4 5 6 7 8
## -19.725 22.475 8.475 -67.325 85.275 -23.525 -35.725 30.075
with(oatvar, tapply(yield, block,mean)) - ybar
## I II III IV V
## 28.875 3.375 29.000 -13.125 -48.125
Note that R
lazily doesn’t give you estimates of \(\alpha_8\) and \(\beta_5\), because they can be computed as minus the sum of the other ones.
These estimates are not of interest by themselves, since they depend on the constraints imposed. It doesn’t make any sense to say “variety 4 is significant”, or “variety 4 is significantly different from 0”. The analysis of variance table tells you there is a significant difference between the varieties, but it doesn’t tell you more than this.
One way to provide information about the individual varieties is to construct confidence intervals for the difference between any pair of varieties, or to test the hypothesis that the expected yield of variety \(i\) is equal to the expected value of variety \(j\) (\(H_0:\alpha_i = \alpha_j\)). There are many ways to do this, and no clearly best way to do this in all circumstances; the problem is sometimes called “multiple comparisons”, and there is an R
package multicomp
that implements many of these. Ch 15.4 in FLM-2 addresses pairwise comparisons.
In the absence of any prior knowledge about the 8 varieties, we might think of comparing every pair of expected values, leading to \({8\choose 2} = 28\) possible confidence intervals or \(p\)-values. There are many ways to adjust either the \(p\)-values or the width of the confidence intervals for this multiplicity, often described as ensuring the “family-wise error rate”, where the “family” in question is the set of all the comparisons. Tukey’s “honest significant difference” is based on the distribution of the \(\max \bar y_{i.} - \min \bar y_{i.}\) when \(\bar y_{1.}, \dots, \bar y_{T.}\) are assumed to be independent normal random variables, and an estimate of \(\text{var}(y_{ij})\) is available. The package TukeyHSD
will take the output of an aov
and return the comparisons. And aov
is itself a wrapper to lm
that is suitable for balanced experiments. It gives the same results as lm
but has a different summary
command (see ?help(aov)
).
options(contrasts=c("contr.sum","contr.poly")) # in class I left of the `s` on contrasts in the LHS
twoway3 <- aov(yield ~ variety + block, data = oatvar)
allpairs <- TukeyHSD(twoway3, "variety") # gives all 28 confidence intervals, probably too much information!
plot(allpairs)
Because the experiment is completely balanced, the difference between the upper and lower confidence limits is the same for all comparisons. For example
allpairs$variety[1,3] - allpairs$variety[1,2]
## [1] 151.2471
allpairs$variety[2,3] - allpairs$variety[2,2]
## [1] 151.2471
and this is nearly consistent with the formula given in FLM-2 on p.229 (Faraway forgot to take the square root of \(1/J_i + 1/J_j\), in his notation, although in the code just below this he did, correctly, use the sqrt. The 1st edition has it done correctly in Sec.~14.4):
qtukey(0.95,8, 28)*sqrt(1336.9)*sqrt(2/5)/sqrt(2)
## [1] 75.62344
A different family-wise correction for \(p\)-values is to state that a comparison is significant at level 0.05 if \(p < 0.05/n_{pairs}\), where \(n_{pairs}\) is the number of comparisons; in our case 28, and \(0.05/28 = 0.0018\). This is called a Bonferroni adjustment. In terms of confidence intervals, we would use the range 2*abs(qnorm(0.0018)) *sqrt(MSE/5) = 95.2
.