A note about contrasts in designed experiments

The traditional expression of the linear model for a randomized block design is \[ y_{ij} = \mu + \alpha_i + \beta_j + \epsilon_{ij}, \quad j = 1, \dots, B; \quad i = 1, \dots T. \] Of course this can also be written \(y = X\beta + \epsilon\) for suitable choices of \(X\) and \(\beta\), but the model above is more intuitive. Note that under the assumption that \(\mathbb{E}(\epsilon_{ij})=0\), we have \[ \mathbb{E}(y_{ij}) = \mu + \alpha_i + \beta_j, \] which is our first clue that this model is over-parametrized, because we’ve expressed a single real number as the sum of three other numbers. A fancier way to describe the over-parametrization under the assumption that the \(\epsilon\) are i.i.d. normal with constant variance \(\sigma^2\) is to note that the likelihood function \(\ell(\mu, \alpha_1, \dots, \alpha_T, \beta_1, \dots, \beta_B, \sigma^2; y_{11}, \dots, y_{BT})\) has \(T+B+2\) parameters, but only \(T+B\) sufficient statistics. Even without the normal assumption, though, the design matrix \(X\) for this model as written above will be of the form \[ \begin{pmatrix} y_{11} \\ \vdots \\ y_{1B}\\ \vdots \\ y_{T1} \\ \vdots \\ y_{TB} \end{pmatrix} = \begin{pmatrix} 1 & 1 &0 & \dots & 1 & 0 \dots \\ 1 & 1 & 0 & \dots & 0 & 1 \dots \\ \vdots & &&& \vdots \\ 1 & 0 & \dots 1 & 0 & \dots & \dots &1 \end{pmatrix} \begin{pmatrix} \mu \\ \alpha_1 \\ \vdots \\ \alpha_T \\ \beta_1 \\ \vdots \\ \beta_B \end{pmatrix} + \begin{pmatrix} \epsilon_{11} \\ \vdots \\ \epsilon_{1B}\\ \vdots \\ \epsilon_{T1} \\ \vdots \\ \epsilon_{TB} \end{pmatrix} \] and this matrix has two redundant columns, because the sum of the 2nd through \(T+1\)st column is a column of \(1's\), and the sum of the last \(B\) columns is a column of \(1\)’s.

When you fit the model in R using lm, the \(X\) matrix is automatically made full rank by removing two columns, and the default is to remove the columns corresponding to \(\alpha_1\) and \(\beta_1\). This gives the analysis of variance table and estimates of coefficients in the R Markdown document for the lecture on October 8. This default can be accessed with the options command: setting the baseline category to be the reference is called contr.treatment, or the treatment contrast. (The second argument of contrasts is for ordered factors.)

options("contrasts")
## $contrasts
##         unordered           ordered 
## "contr.treatment"      "contr.poly"
data(oatvar)
twoway <- lm(yield ~ variety + block, data = oatvar)
summary(twoway)
## 
## Call:
## lm(formula = yield ~ variety + block, data = oatvar)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -67.275 -19.675  -4.638  17.725  74.925 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  363.275     20.027  18.139  < 2e-16 ***
## variety2      42.200     23.125   1.825 0.078707 .  
## variety3      28.200     23.125   1.219 0.232841    
## variety4     -47.600     23.125  -2.058 0.048968 *  
## variety5     105.000     23.125   4.541 9.73e-05 ***
## variety6      -3.800     23.125  -0.164 0.870656    
## variety7     -16.000     23.125  -0.692 0.494701    
## variety8      49.800     23.125   2.154 0.040035 *  
## blockII      -25.500     18.282  -1.395 0.174036    
## blockIII       0.125     18.282   0.007 0.994593    
## blockIV      -42.000     18.282  -2.297 0.029282 *  
## blockV       -77.000     18.282  -4.212 0.000238 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 36.56 on 28 degrees of freedom
## Multiple R-squared:  0.7477, Adjusted R-squared:  0.6485 
## F-statistic: 7.542 on 11 and 28 DF,  p-value: 7.723e-06

The coefficient estimates are more intuitive if you use the so-called sum contrasts (\(\Sigma\alpha_i=0, \Sigma\beta_j=0\)):

options(contrasts=c("contr.sum","contr.poly"))
twoway2 <- lm(yield ~ variety + block, data = oatvar)
summary(twoway2)
## 
## Call:
## lm(formula = yield ~ variety + block, data = oatvar)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -67.275 -19.675  -4.637  17.725  74.925 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  354.125      5.781  61.254  < 2e-16 ***
## variety1     -19.725     15.296  -1.290 0.207748    
## variety2      22.475     15.296   1.469 0.152880    
## variety3       8.475     15.296   0.554 0.583924    
## variety4     -67.325     15.296  -4.402 0.000142 ***
## variety5      85.275     15.296   5.575 5.78e-06 ***
## variety6     -23.525     15.296  -1.538 0.135270    
## variety7     -35.725     15.296  -2.336 0.026901 *  
## block1        28.875     11.562   2.497 0.018666 *  
## block2         3.375     11.562   0.292 0.772520    
## block3        29.000     11.562   2.508 0.018208 *  
## block4       -13.125     11.562  -1.135 0.265941    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 36.56 on 28 degrees of freedom
## Multiple R-squared:  0.7477, Adjusted R-squared:  0.6485 
## F-statistic: 7.542 on 11 and 28 DF,  p-value: 7.723e-06

Although the estimated coefficients change, the analysis of variance tables are unchanged. Also unchanged are any differences between varieties (or blocks).

The estimates under the sum constraints are: \(\hat\mu = \bar y_{..}, \hat\alpha_i = \bar y_{i.}-\bar y_{..}, \hat\beta_j = \bar y_{.j}-\bar y_{..}\):

ybar <- mean(oatvar$yield)
with(oatvar, tapply(yield,variety,mean)) - ybar 
##       1       2       3       4       5       6       7       8 
## -19.725  22.475   8.475 -67.325  85.275 -23.525 -35.725  30.075
with(oatvar, tapply(yield, block,mean)) - ybar
##       I      II     III      IV       V 
##  28.875   3.375  29.000 -13.125 -48.125

Note that R lazily doesn’t give you estimates of \(\alpha_8\) and \(\beta_5\), because they can be computed as minus the sum of the other ones.

These estimates are not of interest by themselves, since they depend on the constraints imposed. It doesn’t make any sense to say “variety 4 is significant”, or “variety 4 is significantly different from 0”. The analysis of variance table tells you there is a significant difference between the varieties, but it doesn’t tell you more than this.

One way to provide information about the individual varieties is to construct confidence intervals for the difference between any pair of varieties, or to test the hypothesis that the expected yield of variety \(i\) is equal to the expected value of variety \(j\) (\(H_0:\alpha_i = \alpha_j\)). There are many ways to do this, and no clearly best way to do this in all circumstances; the problem is sometimes called “multiple comparisons”, and there is an R package multicomp that implements many of these. Ch 15.4 in FLM-2 addresses pairwise comparisons.

In the absence of any prior knowledge about the 8 varieties, we might think of comparing every pair of expected values, leading to \({8\choose 2} = 28\) possible confidence intervals or \(p\)-values. There are many ways to adjust either the \(p\)-values or the width of the confidence intervals for this multiplicity, often described as ensuring the “family-wise error rate”, where the “family” in question is the set of all the comparisons. Tukey’s “honest significant difference” is based on the distribution of the \(\max \bar y_{i.} - \min \bar y_{i.}\) when \(\bar y_{1.}, \dots, \bar y_{T.}\) are assumed to be independent normal random variables, and an estimate of \(\text{var}(y_{ij})\) is available. The package TukeyHSD will take the output of an aov and return the comparisons. And aov is itself a wrapper to lm that is suitable for balanced experiments. It gives the same results as lm but has a different summary command (see ?help(aov)).

options(contrasts=c("contr.sum","contr.poly")) # in class I left of the `s` on contrasts in the LHS
twoway3 <- aov(yield ~ variety + block, data = oatvar)
allpairs <- TukeyHSD(twoway3, "variety")  # gives all 28 confidence intervals, probably too much information!
plot(allpairs)

Because the experiment is completely balanced, the difference between the upper and lower confidence limits is the same for all comparisons. For example

allpairs$variety[1,3] -  allpairs$variety[1,2]
## [1] 151.2471
allpairs$variety[2,3] -  allpairs$variety[2,2]
## [1] 151.2471

and this is nearly consistent with the formula given in FLM-2 on p.229 (Faraway forgot to take the square root of \(1/J_i + 1/J_j\), in his notation, although in the code just below this he did, correctly, use the sqrt. The 1st edition has it done correctly in Sec.~14.4):

qtukey(0.95,8, 28)*sqrt(1336.9)*sqrt(2/5)/sqrt(2)
## [1] 75.62344

A different family-wise correction for \(p\)-values is to state that a comparison is significant at level 0.05 if \(p < 0.05/n_{pairs}\), where \(n_{pairs}\) is the number of comparisons; in our case 28, and \(0.05/28 = 0.0018\). This is called a Bonferroni adjustment. In terms of confidence intervals, we would use the range 2*abs(qnorm(0.0018)) *sqrt(MSE/5) = 95.2.