References:

  1. Brieman, L. (2001). Statistical Modeling: The Two Cultures.

  2. James, G., et al. An Introduction to Statistical Learning: with Applications in R.

Brieman’s Two Cultures

\[ {\bf y} \leftarrow \text{Nature} \leftarrow {\bf x}\]

Goals of Data Analysis

Prediction: To be able to predict what the responses are going to be to future input variables. Information: To extract some information about how nature is associating the response variables to the input variables.

Data Modelling Culture

\[\text{response variables}=f\left(\text{predictor variables, random noise, parameters} \right). \]

Algorithmic Modelling Culture

Using Data Models

Predictive Accuracy

Theory in Algorithmic Modelling

… nature produces data in a black box whose insides are complex, mysterious, and,atleast,partly unknowable.

Multiplicity of Models

Resampling Methods

Cross-Validation

  • Estimate \(f\) on the basis of training observations \(\{(x_1,y_1),\ldots,(x_n,y_n)\}\), where \(y_i\) are qualitative.

  • To quantify accuracy of our estimate \(\hat f\) is the training error rate:

\[\frac{1}{n} \sum_{i=1}^{n} I\left( y_i \ne \hat{y_i}\right),\]

where \(\hat{y_i}\) is the predicted class for the \(i^{th}\) observation using \(\hat f\).

  • Called the training error rate since it was used to train the classifier.

  • If the classifier is applied to a set of test observations (i.e., observations not used to build the classifier) then the error rate is called the test error rate.

  • A good classifier is one for which the test error is smallest.

Example

The hepatitis data set contains survival or nonsurvival of 155 hepatitis patients with 19 covariates.

library(tidyverse)
library(boot)
library(broom)
varnames <- c("DEATH","AGE", "SEX", "STEROID","ANTIVIRALS","FATIGUE",
              "MALAISE","ANOREXIA","LIVERBIG","LIVERFIRM","SPLEENPALP",
              "SPIDERS","ASCITES","VARICES","BILIRUBIN","ALKPHOS","SGOT",
              "ALBUMIN","PROTIME","HISTOLOGY")

hepatitis_data <- read_csv("~/Dropbox/Docs/STA2453H/4002/Sta4002-2017/class5/hepatitis.data.csv", col_names = varnames,col_types = cols(DEATH = col_factor(levels = c("1", "2")), 
                                               AGE = col_integer(), 
                                               SEX = col_factor(levels = c("1","2")), 
                                               STEROID = col_factor(levels = c("1","2")),
                                               ANTIVIRALS = col_factor(levels=c("1","2")),
                                               FATIGUE=col_factor(levels=c("1","2")),
                                               MALAISE=col_factor(levels=c("1","2")),
                                               ANOREXIA=col_factor(levels=c("1","2")),
                                               LIVERBIG=col_factor(levels=c("1","2")),
                                               SPLEENPALP=col_factor(levels=c("1","2")),
                                               SPIDERS=col_factor(levels=c("1","2")),
                                               ASCITES=col_factor(levels=c("1","2")),
                                               VARICES=col_factor(levels=c("1","2")),
                                               HISTOLOGY=col_factor(levels=c("1","2"))))
                                               
# remove missing values
lrdat <- hepatitis_data  %>% filter(complete.cases(.))
  • Fit a simple logistic regression model using FATIGUE.
  • Randomly select 10% of the data as a test set. Train the model on 90% of the data.
library(tidyverse)
library(boot)
library(broom)

set.seed(100)
train <- lrdat %>% sample_n(72) # random sample 90% of data to train
test <- lrdat %>% anti_join(train)

# Train the model
lr_train <- glm(DEATH~FATIGUE,data=train,family = binomial)
pred_train <- predict(lr_train, type = "response")
head(pred_train)
##         1         2         3         4         5         6 
## 0.7916667 0.9166667 0.9166667 0.7916667 0.7916667 0.9166667
pred_class_train <- ifelse(pred_train>0.5,2,1)
obs_class_train <- train$DEATH

# training error rate
sum(pred_class_train!=obs_class_train)/length(obs_class_train)
## [1] 0.1666667
# Predict class using training data
pred_test <- predict(lr_train,newdata = test,type = "response")
pred_test
##         1         2         3         4         5         6         7 
## 0.9166667 0.9166667 0.7916667 0.9166667 0.7916667 0.7916667 0.7916667 
##         8 
## 0.9166667
# classify a case as alive if predicted prob > 0.5
pred_class_test <- ifelse(pred_test>0.5,2,1)
pred_class_test
## 1 2 3 4 5 6 7 8 
## 2 2 2 2 2 2 2 2
#observed class in test set
obs_class <- test$DEATH

# test error rate
sum(pred_test!=obs_class)/length(obs_class)
## [1] 1
  • Ideally we would have a large designated test set available.
  • A class of methods based on holding out a subset of training observations from fitting process based on randomly dividing observations into two parts: a traing set; and validation set.
  • Training error rate can be highly variable depending on which observations are included in the training set.
  • Statistical models tend to perform worse with less observations so test error rate may be overestimated if the model was fit on entire data set.

Leave-One-Out Cross-Validation

  • LOOCV involves splitting the set of observations into two parts.
  • Instead of creating two subsets of comparable size, a single observation is used for the test set, and the remaining observations make up the training set.

  • The LOOCV estimate for the test error rate is:

\[CV_{n}=\frac{1}{n}\sum_{i=1}^n error_i,\] where,

\[ error_i = \left\{ \begin{array}{ll} 1 & \mbox{if } y_i \ne {\hat y_i} \\ 0 & \mbox{if } y_i = {\hat y_i} \end{array} \right. \]

In this case \(\hat y_i\) is the predicted value of \(y_i\) using the model trained using \(y_1,\ldots,y_{i-1},y_{i+1},\ldots y_n\).

  • Use cv.glm() in boot library to obtain estimate of LOOCV.
library(tidyverse)
library(boot)
library(broom)

lrmod1 <- glm(DEATH~FATIGUE,data=lrdat,family = binomial)

cost <- function(r, pi = 0) mean(abs(r-pi) > 0.5)
#LOOCV error rate
cv.glm(lrdat,lrmod1,cost,K=nrow(lrdat))$delta
## [1] 0.1625 0.1625

Therefore, our cross-validation estimate for the test error is

\(k\)-Fold Cross-Validation

Repeat steps 1-3 \(k\) times:

  1. Divide the set of observations into \(k\) groups (folds) of approximately equal size.

  2. The first fold is treated as the validation set, and the method is fit on the remaining \(k-1\) folds.

  3. Compute the error on the observations that in the held out fold.

Comments:

  • LOOCV is a special case of \(k\)-fold CV where \(k=n\).
  • Less expensive computation compared to LOOCV.
  • \(k=5\) or \(k=10\) in practice.
library(tidyverse)
library(boot)
library(broom)

# k-fold cross validation

cost <- function(r, pi = 0) mean(abs(r-pi) > 0.5)
set.seed(17)
cv.error.10=numeric(length=10)
 for (i in 1:10){
glm.fit <- glm(DEATH~FATIGUE,data=lrdat,family = binomial)
cv.error.10[i]=cv.glm(lrdat,glm.fit,cost,K=10)$delta[1]
}

cv.error.10
##  [1] 0.1625 0.1625 0.1625 0.1625 0.1625 0.1625 0.1625 0.1625 0.1625 0.1625
mean(cv.error.10)
## [1] 0.1625

Classification Trees

  • Binary tree structured classifiers are constructed by repeated splits of subsets of the set of all covariates.

  • The construction of a tree involves:

  1. Selection of splits,
  2. When to declare a node terminal or to continue splitting it, and
  3. Assigning each terminal node to a class.
library(tree)
tree.mod <- tree(DEATH~.,data=hepatitis_data)
summary(tree.mod)
## 
## Classification tree:
## tree(formula = DEATH ~ ., data = hepatitis_data)
## Variables actually used in tree construction:
## [1] "PROTIME"   "HISTOLOGY" "BILIRUBIN" "SGOT"     
## Number of terminal nodes:  6 
## Residual mean deviance:  0.3104 = 22.97 / 74 
## Misclassification error rate: 0.075 = 6 / 80
plot(tree.mod)
text(tree.mod,pretty = 0)

tree.mod
## node), split, n, deviance, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 80 71.010 2 ( 0.16250 0.83750 )  
##    2) PROTIME < 46.5 22 30.500 2 ( 0.50000 0.50000 )  
##      4) HISTOLOGY: 1 9  6.279 2 ( 0.11111 0.88889 ) *
##      5) HISTOLOGY: 2 13 14.050 1 ( 0.76923 0.23077 )  
##       10) BILIRUBIN < 1.1 6  8.318 2 ( 0.50000 0.50000 ) *
##       11) BILIRUBIN > 1.1 7  0.000 1 ( 1.00000 0.00000 ) *
##    3) PROTIME > 46.5 58 17.400 2 ( 0.03448 0.96552 )  
##      6) HISTOLOGY: 1 38  0.000 2 ( 0.00000 1.00000 ) *
##      7) HISTOLOGY: 2 20 13.000 2 ( 0.10000 0.90000 )  
##       14) SGOT < 50.5 7  8.376 2 ( 0.28571 0.71429 ) *
##       15) SGOT > 50.5 13  0.000 2 ( 0.00000 1.00000 ) *

Bagging and Random Forests

Bagging

  • Decision trees suffer from high variance: if we split the training data into two parts at random, and fit a decision tree to both halves, the results that we get could be quite different.

  • A procedure with low variance will yield similar results if applied repeatedly to distinct data sets.

  • Given \(Z_1,\ldots, Z_n\) each with variance \(\sigma^2\), the variance of the mean \(\bar Z\) is \(\sigma^2/n\).

  • Averaging reduces variance.

  • a natural way to reduce the variance and hence increase the prediction accuracy of a statistical learning method is to take many training sets from the population, build a separate prediction model using each training set, and average the resulting predictions.

  • Calculate \({\hat f}^1(x),\dots,{\hat f}^B(x)\) using \(B\) training sets and average to obtain a low-variance learning model.

\[{\hat f}_{avg}(x)=\frac{1}{B}\sum_{i=1}^B {\hat f}^i(x) \]

  • Generate \(B\) different bootstrapped training data sets.

Bootstrap

Calculate the standard error of

\[\alpha = \frac{\sigma^2_Y-\sigma_{XY}}{\sigma^2_X+\sigma^2_Y-2\sigma_{XY}}.\]

set.seed(1)
library(boot)
library(tidyverse)
data <- data_frame(X=rnorm(100),Y=rnorm(100))

alpha.fn <- function(data,index){
  X=data$X[index]
  Y=data$Y[index]
  return((var(Y)-cov(X,Y))/(var(X)+var(Y)-2*cov(X,Y)))
}

alpha.fn(data,1:100)
## [1] 0.5320886
b <- boot(data,alpha.fn,R=1000)
b
## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = data, statistic = alpha.fn, R = 1000)
## 
## 
## Bootstrap Statistics :
##      original       bias    std. error
## t1* 0.5320886 0.0004017625  0.05318517
head(b$t)
##           [,1]
## [1,] 0.5387775
## [2,] 0.5265271
## [3,] 0.5465012
## [4,] 0.5014439
## [5,] 0.5161249
## [6,] 0.5489647

\(\hat{\alpha}=0.5320886\) and bootstrap estimate for \(SE(\hat \alpha)= 0.05318517\).

  • Train our method on the \(b\)th bootstrapped training set in order to get \({\hat f}^{*b}(x)\).

  • Average all the predictions, to obtain

\[{\hat f}_{bag}(x)=\frac{1}{B}\sum_{b=1}^B {\hat f}^{*b}(x) \]

This is called bagging or Bootstrap aggregation.

  • To apply bagging to trees, construct \(B\) trees using \(B\) bootstrapped training sets, and average the resulting predictions.
  • Trees are grown deep, and not pruned.
  • Each individual tree has high variance, but low bias.
  • Averaging these B trees reduces the variance.
  • For a given test observation, record the class predicted by each of the \(B\) trees, and take a majority vote: the overall prediction is the most commonly occurring majority class among the \(B\) predictions.
  • Test error can be estimated using out-of-bag (OOB) error estimation.

Random Forests

  • Build a number of decision trees on bootstrapped training samples.

  • When building these decision trees, each time a split in a tree is considered, a random sample of \(m\) predictors is chosen as split candidates from the full set of \(p\) predictors.

  • A new sample of \(\sqrt{p}\) predictors is chosen at each split.

  • This avoids many of the trees using a few strong predictors.

  • Bagging is a special case of Random Forests with \(m=p\).

library(randomForest)
set.seed(1)

# Bagged model
bag.model <- randomForest(DEATH~.,data=train,mtry=19,importance=TRUE)
bag.model
## 
## Call:
##  randomForest(formula = DEATH ~ ., data = train, mtry = 19, importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 19
## 
##         OOB estimate of  error rate: 16.67%
## Confusion matrix:
##   1  2 class.error
## 1 5  7  0.58333333
## 2 5 55  0.08333333
# Random Forests model
rf.model <- randomForest(DEATH~.,data=train,mtry=round(sqrt(19)),importance=TRUE)
rf.model
## 
## Call:
##  randomForest(formula = DEATH ~ ., data = train, mtry = round(sqrt(19)),      importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 4
## 
##         OOB estimate of  error rate: 11.11%
## Confusion matrix:
##   1  2 class.error
## 1 6  6  0.50000000
## 2 2 58  0.03333333