References:

Brieman’s Two Cultures

Data is generated by a black box where a vector of input varaibles \(\bf x\) go in one side and on the other side response variables \(\bf y\) come out.

\[ {\bf y} \leftarrow \text{Nature} \leftarrow {\bf x}\]

Goals of Data Analysis

Prediction: To be able to predict what the responses are going to be to future input variables. Information: To extract some information about how nature is associating the response variables to the input variables.

Brieman (2001) described two cultures in statistical modelling: data modelling and algorithmic modelling.

Data Modelling Culture

Data modelling: data are generated from independent draws from

\[\text{response variables}=f\left(\text{predictor variables, random noise, parameters} \right). \]

Parameters estimated from data and the model is used for information/prediction.
Blackbox is filled in with regression (linear, logistic, Cox).
Model validataion: Goodness of fit tests, and residual examination.

Algorithmic Modelling Culture

This culture considers the inside of the box complex and unknown.
Their approach is to find a function \(f({\bf x})\) —an algorithm that operates on \({\bf x}\) to predict the responses \(\bf y\).
Model validation measured by predictive accuracy.

Using Data Models

Conclusions are about the model’s mechanism, and not about natures mechanism.
Therefore, if the model is a poor model of nature then conclusions may be wrong.
Adding or deleting variables, nonlinear combinations of variables means that goodness-of-fit tests are not applicable.
Residual analysis is typically useful for data with two or three variables.
There are often a “multiplicity” of data models.
Which one most accurately reflects the data is difficult to resolve with goodness of fit tests.

Predictive Accuracy

Drop a case \(\bf x\) into the “nature’s box” and output \(y\).
Drop a case \(\bf x\) into the “model box” and output \(y^{\prime}\).
If \(y\) is close to \(y^{\prime}\) then the model is a good emulation of nature.

Theory in Algorithmic Modelling

… nature produces data in a black box whose insides are complex, mysterious, and,atleast,partly unknowable.

Observe a set of \(\bf x\)’s and \(\bf y\)’s.
Find an algorithm \(f({\bf x})\) such that for a future \(\bf x\), \(f({\bf x})\).
Focus is on predictive accuracy.

Multiplicity of Models

Results can be unstable if data is preturbed slightly (e.g., random sample).
Aggregating over a large set of completing models can reduce nonuniqueness and improve accuracy.

Resampling Methods

Cross-Validation

Estimate \(f\) on the basis of training observations \(\{(x_1,y_1),\ldots,(x_n,y_n)\}\), where \(y_i\) are qualitative.
To quantify accuracy of our estimate \(\hat f\) is the training error rate:

\[\frac{1}{n} \sum_{i=1}^{n} I\left( y_i \ne \hat{y_i}\right),\]

where \(\hat{y_i}\) is the predicted class for the \(i^{th}\) observation using \(\hat f\).

Called the training error rate since it was used to train the classifier.
If the classifier is applied to a set of test observations (i.e., observations not used to build the classifier) then the error rate is called the test error rate.
A good classifier is one for which the test error is smallest.

Example

The hepatitis data set contains survival or nonsurvival of 155 hepatitis patients with 19 covariates.

library(tidyverse)
library(boot)
library(broom)
varnames <- c("DEATH","AGE", "SEX", "STEROID","ANTIVIRALS","FATIGUE",
              "MALAISE","ANOREXIA","LIVERBIG","LIVERFIRM","SPLEENPALP",
              "SPIDERS","ASCITES","VARICES","BILIRUBIN","ALKPHOS","SGOT",
              "ALBUMIN","PROTIME","HISTOLOGY")

hepatitis_data <- read_csv("~/Dropbox/Docs/STA2453H/4002/Sta4002-2017/class5/hepatitis.data.csv", col_names = varnames,col_types = cols(DEATH = col_factor(levels = c("1", "2")), 
                                               AGE = col_integer(), 
                                               SEX = col_factor(levels = c("1","2")), 
                                               STEROID = col_factor(levels = c("1","2")),
                                               ANTIVIRALS = col_factor(levels=c("1","2")),
                                               FATIGUE=col_factor(levels=c("1","2")),
                                               MALAISE=col_factor(levels=c("1","2")),
                                               ANOREXIA=col_factor(levels=c("1","2")),
                                               LIVERBIG=col_factor(levels=c("1","2")),
                                               SPLEENPALP=col_factor(levels=c("1","2")),
                                               SPIDERS=col_factor(levels=c("1","2")),
                                               ASCITES=col_factor(levels=c("1","2")),
                                               VARICES=col_factor(levels=c("1","2")),
                                               HISTOLOGY=col_factor(levels=c("1","2"))))
                                               
# remove missing values
lrdat <- hepatitis_data  %>% filter(complete.cases(.))

Fit a simple logistic regression model using FATIGUE.
Randomly select 10% of the data as a test set. Train the model on 90% of the data.

library(tidyverse)
library(boot)
library(broom)

set.seed(100)
train <- lrdat %>% sample_n(72) # random sample 90% of data to train
test <- lrdat %>% anti_join(train)

# Train the model
lr_train <- glm(DEATH~FATIGUE,data=train,family = binomial)
pred_train <- predict(lr_train, type = "response")
head(pred_train)

##         1         2         3         4         5         6 
## 0.7916667 0.9166667 0.9166667 0.7916667 0.7916667 0.9166667

pred_class_train <- ifelse(pred_train>0.5,2,1)
obs_class_train <- train$DEATH

# training error rate
sum(pred_class_train!=obs_class_train)/length(obs_class_train)

## [1] 0.1666667

# Predict class using training data
pred_test <- predict(lr_train,newdata = test,type = "response")
pred_test

##         1         2         3         4         5         6         7 
## 0.9166667 0.9166667 0.7916667 0.9166667 0.7916667 0.7916667 0.7916667 
##         8 
## 0.9166667

# classify a case as alive if predicted prob > 0.5
pred_class_test <- ifelse(pred_test>0.5,2,1)
pred_class_test

## 1 2 3 4 5 6 7 8 
## 2 2 2 2 2 2 2 2

#observed class in test set
obs_class <- test$DEATH

# test error rate
sum(pred_test!=obs_class)/length(obs_class)

## [1] 1

Ideally we would have a large designated test set available.
A class of methods based on holding out a subset of training observations from fitting process based on randomly dividing observations into two parts: a traing set; and validation set.
Training error rate can be highly variable depending on which observations are included in the training set.
Statistical models tend to perform worse with less observations so test error rate may be overestimated if the model was fit on entire data set.

Leave-One-Out Cross-Validation

LOOCV involves splitting the set of observations into two parts.
Instead of creating two subsets of comparable size, a single observation is used for the test set, and the remaining observations make up the training set.
The LOOCV estimate for the test error rate is:

\[CV_{n}=\frac{1}{n}\sum_{i=1}^n error_i,\] where,

\[ error_i = \left\{ \begin{array}{ll} 1 & \mbox{if } y_i \ne {\hat y_i} \\ 0 & \mbox{if } y_i = {\hat y_i} \end{array} \right. \]

In this case \(\hat y_i\) is the predicted value of \(y_i\) using the model trained using \(y_1,\ldots,y_{i-1},y_{i+1},\ldots y_n\).

Use cv.glm() in boot library to obtain estimate of LOOCV.

library(tidyverse)
library(boot)
library(broom)

lrmod1 <- glm(DEATH~FATIGUE,data=lrdat,family = binomial)

cost <- function(r, pi = 0) mean(abs(r-pi) > 0.5)
#LOOCV error rate
cv.glm(lrdat,lrmod1,cost,K=nrow(lrdat))$delta

## [1] 0.1625 0.1625

Therefore, our cross-validation estimate for the test error is

\(k\)-Fold Cross-Validation

Repeat steps 1-3 \(k\) times:

Divide the set of observations into \(k\) groups (folds) of approximately equal size.
The first fold is treated as the validation set, and the method is fit on the remaining \(k-1\) folds.
Compute the error on the observations that in the held out fold.

Comments:

LOOCV is a special case of \(k\)-fold CV where \(k=n\).
Less expensive computation compared to LOOCV.
\(k=5\) or \(k=10\) in practice.

library(tidyverse)
library(boot)
library(broom)

# k-fold cross validation

cost <- function(r, pi = 0) mean(abs(r-pi) > 0.5)
set.seed(17)
cv.error.10=numeric(length=10)
 for (i in 1:10){
glm.fit <- glm(DEATH~FATIGUE,data=lrdat,family = binomial)
cv.error.10[i]=cv.glm(lrdat,glm.fit,cost,K=10)$delta[1]
}

cv.error.10

##  [1] 0.1625 0.1625 0.1625 0.1625 0.1625 0.1625 0.1625 0.1625 0.1625 0.1625

mean(cv.error.10)

## [1] 0.1625

Classification Trees

Binary tree structured classifiers are constructed by repeated splits of subsets of the set of all covariates.
The construction of a tree involves:

Selection of splits,
When to declare a node terminal or to continue splitting it, and
Assigning each terminal node to a class.

library(tree)
tree.mod <- tree(DEATH~.,data=hepatitis_data)
summary(tree.mod)

## 
## Classification tree:
## tree(formula = DEATH ~ ., data = hepatitis_data)
## Variables actually used in tree construction:
## [1] "PROTIME"   "HISTOLOGY" "BILIRUBIN" "SGOT"     
## Number of terminal nodes:  6 
## Residual mean deviance:  0.3104 = 22.97 / 74 
## Misclassification error rate: 0.075 = 6 / 80

plot(tree.mod)
text(tree.mod,pretty = 0)

tree.mod

## node), split, n, deviance, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 80 71.010 2 ( 0.16250 0.83750 )  
##    2) PROTIME < 46.5 22 30.500 2 ( 0.50000 0.50000 )  
##      4) HISTOLOGY: 1 9  6.279 2 ( 0.11111 0.88889 ) *
##      5) HISTOLOGY: 2 13 14.050 1 ( 0.76923 0.23077 )  
##       10) BILIRUBIN < 1.1 6  8.318 2 ( 0.50000 0.50000 ) *
##       11) BILIRUBIN > 1.1 7  0.000 1 ( 1.00000 0.00000 ) *
##    3) PROTIME > 46.5 58 17.400 2 ( 0.03448 0.96552 )  
##      6) HISTOLOGY: 1 38  0.000 2 ( 0.00000 1.00000 ) *
##      7) HISTOLOGY: 2 20 13.000 2 ( 0.10000 0.90000 )  
##       14) SGOT < 50.5 7  8.376 2 ( 0.28571 0.71429 ) *
##       15) SGOT > 50.5 13  0.000 2 ( 0.00000 1.00000 ) *

Bagging and Random Forests

Bagging

Decision trees suffer from high variance: if we split the training data into two parts at random, and fit a decision tree to both halves, the results that we get could be quite different.
A procedure with low variance will yield similar results if applied repeatedly to distinct data sets.
Given \(Z_1,\ldots, Z_n\) each with variance \(\sigma^2\), the variance of the mean \(\bar Z\) is \(\sigma^2/n\).
Averaging reduces variance.
a natural way to reduce the variance and hence increase the prediction accuracy of a statistical learning method is to take many training sets from the population, build a separate prediction model using each training set, and average the resulting predictions.
Calculate \({\hat f}^1(x),\dots,{\hat f}^B(x)\) using \(B\) training sets and average to obtain a low-variance learning model.

\[{\hat f}_{avg}(x)=\frac{1}{B}\sum_{i=1}^B {\hat f}^i(x) \]

Generate \(B\) different bootstrapped training data sets.

Bootstrap

Calculate the standard error of

\[\alpha = \frac{\sigma^2_Y-\sigma_{XY}}{\sigma^2_X+\sigma^2_Y-2\sigma_{XY}}.\]

set.seed(1)
library(boot)
library(tidyverse)
data <- data_frame(X=rnorm(100),Y=rnorm(100))

alpha.fn <- function(data,index){
  X=data$X[index]
  Y=data$Y[index]
  return((var(Y)-cov(X,Y))/(var(X)+var(Y)-2*cov(X,Y)))
}

alpha.fn(data,1:100)

## [1] 0.5320886

b <- boot(data,alpha.fn,R=1000)
b

## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = data, statistic = alpha.fn, R = 1000)
## 
## 
## Bootstrap Statistics :
##      original       bias    std. error
## t1* 0.5320886 0.0004017625  0.05318517

head(b$t)

##           [,1]
## [1,] 0.5387775
## [2,] 0.5265271
## [3,] 0.5465012
## [4,] 0.5014439
## [5,] 0.5161249
## [6,] 0.5489647

\(\hat{\alpha}=0.5320886\) and bootstrap estimate for \(SE(\hat \alpha)= 0.05318517\).

Train our method on the \(b\)th bootstrapped training set in order to get \({\hat f}^{*b}(x)\).
Average all the predictions, to obtain

\[{\hat f}_{bag}(x)=\frac{1}{B}\sum_{b=1}^B {\hat f}^{*b}(x) \]

This is called bagging or Bootstrap aggregation.

To apply bagging to trees, construct \(B\) trees using \(B\) bootstrapped training sets, and average the resulting predictions.
Trees are grown deep, and not pruned.
Each individual tree has high variance, but low bias.
Averaging these B trees reduces the variance.
For a given test observation, record the class predicted by each of the \(B\) trees, and take a majority vote: the overall prediction is the most commonly occurring majority class among the \(B\) predictions.
Test error can be estimated using out-of-bag (OOB) error estimation.

Random Forests

Build a number of decision trees on bootstrapped training samples.
When building these decision trees, each time a split in a tree is considered, a random sample of \(m\) predictors is chosen as split candidates from the full set of \(p\) predictors.
A new sample of \(\sqrt{p}\) predictors is chosen at each split.
This avoids many of the trees using a few strong predictors.
Bagging is a special case of Random Forests with \(m=p\).

library(randomForest)
set.seed(1)

# Bagged model
bag.model <- randomForest(DEATH~.,data=train,mtry=19,importance=TRUE)
bag.model

## 
## Call:
##  randomForest(formula = DEATH ~ ., data = train, mtry = 19, importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 19
## 
##         OOB estimate of  error rate: 16.67%
## Confusion matrix:
##   1  2 class.error
## 1 5  7  0.58333333
## 2 5 55  0.08333333

# Random Forests model
rf.model <- randomForest(DEATH~.,data=train,mtry=round(sqrt(19)),importance=TRUE)
rf.model

## 
## Call:
##  randomForest(formula = DEATH ~ ., data = train, mtry = round(sqrt(19)),      importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 4
## 
##         OOB estimate of  error rate: 11.11%
## Confusion matrix:
##   1  2 class.error
## 1 6  6  0.50000000
## 2 2 58  0.03333333

Inference versus Prediction