\[ {\bf y} \leftarrow \text{Nature} \leftarrow {\bf x}\]
Prediction: To be able to predict what the responses are going to be to future input variables. Information: To extract some information about how nature is associating the response variables to the input variables.
\[\text{response variables}=f\left(\text{predictor variables, random noise, parameters} \right). \]
Parameters estimated from data and the model is used for information/prediction.
Blackbox is filled in with regression (linear, logistic, Cox).
Model validataion: Goodness of fit tests, and residual examination.
This culture considers the inside of the box complex and unknown.
Their approach is to find a function \(f({\bf x})\) —an algorithm that operates on \({\bf x}\) to predict the responses \(\bf y\).
Model validation measured by predictive accuracy.
Conclusions are about the model’s mechanism, and not about natures mechanism.
Therefore, if the model is a poor model of nature then conclusions may be wrong.
Adding or deleting variables, nonlinear combinations of variables means that goodness-of-fit tests are not applicable.
Residual analysis is typically useful for data with two or three variables.
There are often a “multiplicity” of data models.
Which one most accurately reflects the data is difficult to resolve with goodness of fit tests.
… nature produces data in a black box whose insides are complex, mysterious, and,atleast,partly unknowable.
Observe a set of \(\bf x\)’s and \(\bf y\)’s.
Find an algorithm \(f({\bf x})\) such that for a future \(\bf x\), \(f({\bf x})\).
Focus is on predictive accuracy.
Results can be unstable if data is preturbed slightly (e.g., random sample).
Aggregating over a large set of completing models can reduce nonuniqueness and improve accuracy.
Estimate \(f\) on the basis of training observations \(\{(x_1,y_1),\ldots,(x_n,y_n)\}\), where \(y_i\) are qualitative.
To quantify accuracy of our estimate \(\hat f\) is the training error rate:
\[\frac{1}{n} \sum_{i=1}^{n} I\left( y_i \ne \hat{y_i}\right),\]
where \(\hat{y_i}\) is the predicted class for the \(i^{th}\) observation using \(\hat f\).
Called the training error rate since it was used to train the classifier.
If the classifier is applied to a set of test observations (i.e., observations not used to build the classifier) then the error rate is called the test error rate.
A good classifier is one for which the test error is smallest.
The hepatitis data set contains survival or nonsurvival of 155 hepatitis patients with 19 covariates.
library(tidyverse)
library(boot)
library(broom)
varnames <- c("DEATH","AGE", "SEX", "STEROID","ANTIVIRALS","FATIGUE",
"MALAISE","ANOREXIA","LIVERBIG","LIVERFIRM","SPLEENPALP",
"SPIDERS","ASCITES","VARICES","BILIRUBIN","ALKPHOS","SGOT",
"ALBUMIN","PROTIME","HISTOLOGY")
hepatitis_data <- read_csv("~/Dropbox/Docs/STA2453H/4002/Sta4002-2017/class5/hepatitis.data.csv", col_names = varnames,col_types = cols(DEATH = col_factor(levels = c("1", "2")),
AGE = col_integer(),
SEX = col_factor(levels = c("1","2")),
STEROID = col_factor(levels = c("1","2")),
ANTIVIRALS = col_factor(levels=c("1","2")),
FATIGUE=col_factor(levels=c("1","2")),
MALAISE=col_factor(levels=c("1","2")),
ANOREXIA=col_factor(levels=c("1","2")),
LIVERBIG=col_factor(levels=c("1","2")),
SPLEENPALP=col_factor(levels=c("1","2")),
SPIDERS=col_factor(levels=c("1","2")),
ASCITES=col_factor(levels=c("1","2")),
VARICES=col_factor(levels=c("1","2")),
HISTOLOGY=col_factor(levels=c("1","2"))))
# remove missing values
lrdat <- hepatitis_data %>% filter(complete.cases(.))
FATIGUE
.library(tidyverse)
library(boot)
library(broom)
set.seed(100)
train <- lrdat %>% sample_n(72) # random sample 90% of data to train
test <- lrdat %>% anti_join(train)
# Train the model
lr_train <- glm(DEATH~FATIGUE,data=train,family = binomial)
pred_train <- predict(lr_train, type = "response")
head(pred_train)
## 1 2 3 4 5 6
## 0.7916667 0.9166667 0.9166667 0.7916667 0.7916667 0.9166667
pred_class_train <- ifelse(pred_train>0.5,2,1)
obs_class_train <- train$DEATH
# training error rate
sum(pred_class_train!=obs_class_train)/length(obs_class_train)
## [1] 0.1666667
# Predict class using training data
pred_test <- predict(lr_train,newdata = test,type = "response")
pred_test
## 1 2 3 4 5 6 7
## 0.9166667 0.9166667 0.7916667 0.9166667 0.7916667 0.7916667 0.7916667
## 8
## 0.9166667
# classify a case as alive if predicted prob > 0.5
pred_class_test <- ifelse(pred_test>0.5,2,1)
pred_class_test
## 1 2 3 4 5 6 7 8
## 2 2 2 2 2 2 2 2
#observed class in test set
obs_class <- test$DEATH
# test error rate
sum(pred_test!=obs_class)/length(obs_class)
## [1] 1
Instead of creating two subsets of comparable size, a single observation is used for the test set, and the remaining observations make up the training set.
The LOOCV estimate for the test error rate is:
\[CV_{n}=\frac{1}{n}\sum_{i=1}^n error_i,\] where,
\[ error_i = \left\{ \begin{array}{ll} 1 & \mbox{if } y_i \ne {\hat y_i} \\ 0 & \mbox{if } y_i = {\hat y_i} \end{array} \right. \]
In this case \(\hat y_i\) is the predicted value of \(y_i\) using the model trained using \(y_1,\ldots,y_{i-1},y_{i+1},\ldots y_n\).
cv.glm()
in boot
library to obtain estimate of LOOCV.library(tidyverse)
library(boot)
library(broom)
lrmod1 <- glm(DEATH~FATIGUE,data=lrdat,family = binomial)
cost <- function(r, pi = 0) mean(abs(r-pi) > 0.5)
#LOOCV error rate
cv.glm(lrdat,lrmod1,cost,K=nrow(lrdat))$delta
## [1] 0.1625 0.1625
Therefore, our cross-validation estimate for the test error is
Repeat steps 1-3 \(k\) times:
Divide the set of observations into \(k\) groups (folds) of approximately equal size.
The first fold is treated as the validation set, and the method is fit on the remaining \(k-1\) folds.
Compute the error on the observations that in the held out fold.
Comments:
library(tidyverse)
library(boot)
library(broom)
# k-fold cross validation
cost <- function(r, pi = 0) mean(abs(r-pi) > 0.5)
set.seed(17)
cv.error.10=numeric(length=10)
for (i in 1:10){
glm.fit <- glm(DEATH~FATIGUE,data=lrdat,family = binomial)
cv.error.10[i]=cv.glm(lrdat,glm.fit,cost,K=10)$delta[1]
}
cv.error.10
## [1] 0.1625 0.1625 0.1625 0.1625 0.1625 0.1625 0.1625 0.1625 0.1625 0.1625
mean(cv.error.10)
## [1] 0.1625
Binary tree structured classifiers are constructed by repeated splits of subsets of the set of all covariates.
The construction of a tree involves:
library(tree)
tree.mod <- tree(DEATH~.,data=hepatitis_data)
summary(tree.mod)
##
## Classification tree:
## tree(formula = DEATH ~ ., data = hepatitis_data)
## Variables actually used in tree construction:
## [1] "PROTIME" "HISTOLOGY" "BILIRUBIN" "SGOT"
## Number of terminal nodes: 6
## Residual mean deviance: 0.3104 = 22.97 / 74
## Misclassification error rate: 0.075 = 6 / 80
plot(tree.mod)
text(tree.mod,pretty = 0)
tree.mod
## node), split, n, deviance, yval, (yprob)
## * denotes terminal node
##
## 1) root 80 71.010 2 ( 0.16250 0.83750 )
## 2) PROTIME < 46.5 22 30.500 2 ( 0.50000 0.50000 )
## 4) HISTOLOGY: 1 9 6.279 2 ( 0.11111 0.88889 ) *
## 5) HISTOLOGY: 2 13 14.050 1 ( 0.76923 0.23077 )
## 10) BILIRUBIN < 1.1 6 8.318 2 ( 0.50000 0.50000 ) *
## 11) BILIRUBIN > 1.1 7 0.000 1 ( 1.00000 0.00000 ) *
## 3) PROTIME > 46.5 58 17.400 2 ( 0.03448 0.96552 )
## 6) HISTOLOGY: 1 38 0.000 2 ( 0.00000 1.00000 ) *
## 7) HISTOLOGY: 2 20 13.000 2 ( 0.10000 0.90000 )
## 14) SGOT < 50.5 7 8.376 2 ( 0.28571 0.71429 ) *
## 15) SGOT > 50.5 13 0.000 2 ( 0.00000 1.00000 ) *
Decision trees suffer from high variance: if we split the training data into two parts at random, and fit a decision tree to both halves, the results that we get could be quite different.
A procedure with low variance will yield similar results if applied repeatedly to distinct data sets.
Given \(Z_1,\ldots, Z_n\) each with variance \(\sigma^2\), the variance of the mean \(\bar Z\) is \(\sigma^2/n\).
Averaging reduces variance.
a natural way to reduce the variance and hence increase the prediction accuracy of a statistical learning method is to take many training sets from the population, build a separate prediction model using each training set, and average the resulting predictions.
Calculate \({\hat f}^1(x),\dots,{\hat f}^B(x)\) using \(B\) training sets and average to obtain a low-variance learning model.
\[{\hat f}_{avg}(x)=\frac{1}{B}\sum_{i=1}^B {\hat f}^i(x) \]
Calculate the standard error of
\[\alpha = \frac{\sigma^2_Y-\sigma_{XY}}{\sigma^2_X+\sigma^2_Y-2\sigma_{XY}}.\]
set.seed(1)
library(boot)
library(tidyverse)
data <- data_frame(X=rnorm(100),Y=rnorm(100))
alpha.fn <- function(data,index){
X=data$X[index]
Y=data$Y[index]
return((var(Y)-cov(X,Y))/(var(X)+var(Y)-2*cov(X,Y)))
}
alpha.fn(data,1:100)
## [1] 0.5320886
b <- boot(data,alpha.fn,R=1000)
b
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = data, statistic = alpha.fn, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 0.5320886 0.0004017625 0.05318517
head(b$t)
## [,1]
## [1,] 0.5387775
## [2,] 0.5265271
## [3,] 0.5465012
## [4,] 0.5014439
## [5,] 0.5161249
## [6,] 0.5489647
\(\hat{\alpha}=0.5320886\) and bootstrap estimate for \(SE(\hat \alpha)= 0.05318517\).
Train our method on the \(b\)th bootstrapped training set in order to get \({\hat f}^{*b}(x)\).
Average all the predictions, to obtain
\[{\hat f}_{bag}(x)=\frac{1}{B}\sum_{b=1}^B {\hat f}^{*b}(x) \]
This is called bagging or Bootstrap aggregation.
Build a number of decision trees on bootstrapped training samples.
When building these decision trees, each time a split in a tree is considered, a random sample of \(m\) predictors is chosen as split candidates from the full set of \(p\) predictors.
A new sample of \(\sqrt{p}\) predictors is chosen at each split.
This avoids many of the trees using a few strong predictors.
Bagging is a special case of Random Forests with \(m=p\).
library(randomForest)
set.seed(1)
# Bagged model
bag.model <- randomForest(DEATH~.,data=train,mtry=19,importance=TRUE)
bag.model
##
## Call:
## randomForest(formula = DEATH ~ ., data = train, mtry = 19, importance = TRUE)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 19
##
## OOB estimate of error rate: 16.67%
## Confusion matrix:
## 1 2 class.error
## 1 5 7 0.58333333
## 2 5 55 0.08333333
# Random Forests model
rf.model <- randomForest(DEATH~.,data=train,mtry=round(sqrt(19)),importance=TRUE)
rf.model
##
## Call:
## randomForest(formula = DEATH ~ ., data = train, mtry = round(sqrt(19)), importance = TRUE)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 4
##
## OOB estimate of error rate: 11.11%
## Confusion matrix:
## 1 2 class.error
## 1 6 6 0.50000000
## 2 2 58 0.03333333