How Uncertain Are Your Random Forest Predictions?

Sarah Tan

2017-07-18

This vignette demonstrates how to use the surfin R package to compute uncertainty for random forest predictions. We will compare the U-statistics based variance estimate (Mentch & Hooker 2016) provided in this package with another estimate based on the infinitesimal jackknife (Wager, Hastie, Efron, 2014). The latter is provided in the R code randomForestCI by the authors of the paper. For a discussion of the differences between these two variance estimates, see Section 2.1 of Mentch et al.

Package Updates

Categorical predictors are currently converted to their numeric equivalents, not made into indicator variables. This feature is pending. If you have categorical predictors, for now, make them indicators before calling the forest function.

This package is actively under development. Feedback, bug reports, etc. are very much welcome! Find our contact info on the package website.

Setup

Let’s get started! First, load packages needed for this example:

library(surfin)
library(devtools)  # to install randomForestCI package from github
library(randomForest)  # to compare forest implementations
library(rpart) # for kyphosis data
#library(MASS) # for Boston housing and breast cancer data

Next, install and load the randomForestCI R package:

install_github("swager/randomForestCI")
## Skipping install of 'randomForestCI' from a github remote, the SHA1 (8b2b84f3) has not changed since last install.
##   Use `force = TRUE` to force installation
library(randomForestCI)

Regression

We start with a regression example:

#data(Boston)
#x = Boston[,1:(ncol(Boston)-1)]
#y = Boston[,ncol(Boston)]
x = cu.summary[,c("Price","Country","Reliability","Type")]
y = cu.summary$Mileage
keep = !is.na(y)
y = y[keep]
x = x[keep,]
keep = !apply(is.na(x),1,any)
y = y[keep]
x = x[keep,]
n = length(y)
train = sample(1:n,n*0.7)
test = setdiff(1:n,train)
xtrain = x[train,]
ytrain = y[train]
xtest = x[test,]
ytest = y[test]

To compute the U-statistics based variance estimate, we use sampling without replacement and specify B, the number of common observations between trees to be 20. To give an idea of what this means, if we use ntree=5000 trees and B=20 common observations between trees, L=5000/20 = 250 trees will share an observation, then the next 250 trees with share another observation, and so forth. So two of these three parameters can be specified, and the third will automatically follow.

fit = forest(xtrain,ytrain,var.type="ustat",B=20,ntree=5000)
## [1] "U-statistics based variance needs replace=FALSE. The forest was built with this option."
## [1] "U-statistics based variance needs individualTrees=TRUE. The forest was built with this option."

Check out what the forest object outputs:

names(fit)
##  [1] "inbag.times"     "oob.times"       "forest"         
##  [4] "type"            "varNames"        "replace"        
##  [7] "var.type"        "B"               "sampSize"       
## [10] "ntree"           "individualTrees" "predicted"      
## [13] "predictedAll"    "predictedOOB"

Prediction

There are a variety of prediction options to choose from:

  1. On the train set using only out-of-bag train observations

  2. On the train set using all train observations

  3. On the test set using all test observations

u_train_oob = fit$predicted        # Case (1)
u_train = predict(fit,xtrain)  # Case (2)
temp = predict(fit,xtest,individualTrees=T)   # Case (3)
u_test = temp$predicted
u_test_all = temp$predictedAll
temp = data.frame(u_train_oob,u_train)
head(temp)
##   u_train_oob  u_train
## 1    22.10344 21.90237
## 2    24.16341 24.15650
## 3    25.18435 25.85654
## 4    25.10695 21.05798
## 5    24.11299 23.69542
## 6    24.16176 23.98517
head(u_test)
## [1] 31.31556 30.35831 30.45509 24.10829 26.45608 23.91340

U-statistics based variance estimate

We can calculate and plot the u-statistics based variance estimate on the test set (Case (2)):

ustat = forest.varU(u_test_all,fit)
head(ustat)
##      y.hat   var.hat
## 1 31.31556 1.3984477
## 2 30.35831 0.5925618
## 3 30.45509 1.5342394
## 4 24.10829 2.6394619
## 5 26.45608 0.7610817
## 6 23.91340 0.2698203
plot(ustat)

It’s also possible to retrieve not just the variance, but also covariance between predictions:

temp = forest.varU(u_test_all,fit,covariance=TRUE)
y.hat = temp[[1]]
cov = temp[[2]]
dim(cov)
## [1] 15 15
cov[1:6,1:6]
##             [,1]         [,2]        [,3]         [,4]        [,5]
## [1,]  1.39844768 -0.160568433  1.01207637 -0.328440364  0.02343384
## [2,] -0.16056843  0.592561777 -0.11385230  0.006699479  0.20282973
## [3,]  1.01207637 -0.113852305  1.53423943 -0.260049369  0.03713565
## [4,] -0.32844036  0.006699479 -0.26004937  2.639461891  0.60159975
## [5,]  0.02343384  0.202829732  0.03713565  0.601599751  0.76108170
## [6,] -0.29651208  0.087548214 -0.13532654 -0.150396050 -0.02450429
##             [,6]
## [1,] -0.29651208
## [2,]  0.08754821
## [3,] -0.13532654
## [4,] -0.15039605
## [5,] -0.02450429
## [6,]  0.26982026

The (i,j) element in the covariance matrix represents the estimated covariance between observation i and observation j’s predictions.

The diagonals of the covariance matrix are exactly the variances we saw above:

unique(diag(cov) - ustat[,2])
## [1] 0

We can also calculate variance estimate on the train set (Case (1)):

ustat = forest.varU(fit$predictedAll,fit)
head(ustat)
##      y.hat   var.hat
## 1 21.90237 0.1105814
## 2 24.15650 0.2026405
## 3 25.85654 0.7319464
## 4 21.05798 5.2741773
## 5 23.69542 0.1236865
## 6 23.98517 0.1528721
plot(ustat)

Infinitesimal jackknife based variance estimate

Now we compare to the infinitesimal jackknife based variance using the randomForestCI package:

rf = randomForest(xtrain, ytrain, keep.inbag = TRUE, ntree=5000) 
ij = randomForestInfJack(rf, xtrain, calibrate = TRUE)
head(ij)
##                               y.hat    var.hat
## Oldsmobile Cutlass Ciera 4 21.64513 0.08579727
## Mazda 626 4                24.31732 0.15047229
## Honda Prelude Si 4WS 4     25.14416 0.93355101
## Dodge Grand Caravan V6     20.80192 2.07210200
## Buick Skylark 4            23.87048 0.28998848
## Ford Tempo 4               24.16110 0.22380466
plot(ij)

Next we try calling the infinitesimal jackknife code in the randomForestCI package, but using our forest where sampling is performed with replacement:

fit = forest(xtrain,ytrain,var.type="infjack",ntree=5000)
ij2_train_oob = fit$predicted   # Case (1)
ij2 = forest.varIJ(fit$predictedAll,fit)
head(ij2)
##      y.hat    var.hat
## 1 21.81143 0.11325415
## 2 24.12591 0.08539112
## 3 25.84118 0.47283131
## 4 21.01991 5.87878848
## 5 23.58064 0.18573485
## 6 23.90902 0.16662527
plot(ij2)

Compare to randomForest package

Let’s compare our forest’s predictions to the predictions of the randomForest package:

rf_train_oob = rf$predicted
plot(ij2_train_oob,rf_train_oob)
lines(ij2_train_oob,ij2_train_oob,lty="dashed")

Sensitivity Analysis

Let’s compare the variance estimates and see how they change when more trees are used in the forest:

varU = vector("numeric")
varIJ = vector("numeric")
nts = seq(1000,10000,2000)
for (nt in nts)
{
  fit = forest(xtrain,ytrain,var.type="ustat",B=20,ntree=nt)
  varU = c(varU,mean(forest.varU(fit$predictedAll,fit)[,2]))
  rf = randomForest(xtrain, ytrain, keep.inbag = TRUE, ntree=nt) 
  varIJ = c(varIJ,mean(randomForestInfJack(rf, xtrain, calibrate = TRUE)[,2]))
}
## [1] "U-statistics based variance needs replace=FALSE. The forest was built with this option."
## [1] "U-statistics based variance needs individualTrees=TRUE. The forest was built with this option."
## [1] "U-statistics based variance needs replace=FALSE. The forest was built with this option."
## [1] "U-statistics based variance needs individualTrees=TRUE. The forest was built with this option."
## [1] "U-statistics based variance needs replace=FALSE. The forest was built with this option."
## [1] "U-statistics based variance needs individualTrees=TRUE. The forest was built with this option."
## [1] "U-statistics based variance needs replace=FALSE. The forest was built with this option."
## [1] "U-statistics based variance needs individualTrees=TRUE. The forest was built with this option."
## [1] "U-statistics based variance needs replace=FALSE. The forest was built with this option."
## [1] "U-statistics based variance needs individualTrees=TRUE. The forest was built with this option."
plot(nts,varU,ylim=c(0,max(varU,varIJ)),cex.axis=0.6,ylab="Mean Est. Variance",xlab="Number of Trees",type="o",cex.lab=0.5)
points(nts,varIJ,col="blue",type="o")
legend("topright",legend=c("U-Stat","IJ"),col=c("black","blue"),lty="solid",cex=0.6)

print(varU)
## [1] 2.4101866 0.9823698 0.7984688 0.6536290 0.6363574
print(varIJ)
## [1] 1.137313 1.096166 1.103347 1.107860 1.074623

We can also perform sensitivity analysis on the B parameter, the number of common observations between trees.

varU = vector("numeric")
varIJ = vector("numeric")
bs = c(2,5,10,20,25)
for (b in bs)
{
  fit = forest(xtrain,ytrain,var.type="ustat",B=b,ntree=1000)
  varU = c(varU,mean(forest.varU(fit$predictedAll,fit)[,2]))
}
## [1] "U-statistics based variance needs replace=FALSE. The forest was built with this option."
## [1] "U-statistics based variance needs individualTrees=TRUE. The forest was built with this option."
## [1] "U-statistics based variance needs replace=FALSE. The forest was built with this option."
## [1] "U-statistics based variance needs individualTrees=TRUE. The forest was built with this option."
## [1] "U-statistics based variance needs replace=FALSE. The forest was built with this option."
## [1] "U-statistics based variance needs individualTrees=TRUE. The forest was built with this option."
## [1] "U-statistics based variance needs replace=FALSE. The forest was built with this option."
## [1] "U-statistics based variance needs individualTrees=TRUE. The forest was built with this option."
## [1] "U-statistics based variance needs replace=FALSE. The forest was built with this option."
## [1] "U-statistics based variance needs individualTrees=TRUE. The forest was built with this option."
plot(bs,varU,ylim=c(0,max(varU,varIJ)),cex.axis=0.6,ylab="Mean Est. Variance",xlab="B",type="o",cex.lab=0.5)

print(varU)
## [1] 0.2454759 0.8651213 1.1021971 2.3720810 2.9286694

Binary Classification

Next, we try classification. Currently only binary classification is supported:

#data(biopsy)
#x = biopsy[1:(ncol(biopsy)-1)]
#y = biopsy[,ncol(biopsy)]
x = kyphosis[,c("Age","Number","Start")]
y = kyphosis$Kyphosis
keep = !is.na(y)
y = y[keep]
x = x[keep,]
keep = !apply(is.na(x),1,any)
y = y[keep]
x = x[keep,]
n = length(y)
train = sample(1:n,n*0.7)
test = setdiff(1:n,train)
xtrain = x[train,]
ytrain = y[train]
xtest = x[test,]
ytest = y[test]

The response is rather imbalanced:

table(y)
## y
##  absent present 
##      64      17

Prediction

Like for regression, we have several options for prediction.

fit = forest(xtrain,ytrain,var.type="ustat",B=20,ntree=5000)
## [1] "U-statistics based variance needs replace=FALSE. The forest was built with this option."
## [1] "U-statistics based variance needs individualTrees=TRUE. The forest was built with this option."
names(fit)
##  [1] "inbag.times"      "oob.times"        "forest"          
##  [4] "type"             "key"              "varNames"        
##  [7] "replace"          "var.type"         "B"               
## [10] "sampSize"         "ntree"            "individualTrees" 
## [13] "predicted"        "predictedAll"     "predictedProb"   
## [16] "predictedOOB"     "predictedProbOOB"
u_train_oob = fit$predicted        # Case (1)
table(u_train_oob)
## u_train_oob
##  absent present 
##      49       7
u_train = predict(fit,xtrain)  # Case (2)
table(u_train)
## u_train
##  absent present 
##      47       9
temp = predict(fit,xtest,individualTrees=T)   # Case (3)
u_test = temp$predicted
u_test_prob = temp$predictedProb
u_test_all = temp$predictedAll
table(u_test)
## u_test
##  absent present 
##      24       1

U-statistics based variance estimate

Check out the test set variance estimate (Case (2)):

ustat = forest.varU(u_test_all,fit)
head(ustat)
##    y.hat      var.hat
## 1 absent 0.0002777746
## 2 absent 0.0001122600
## 3 absent 0.0318183038
## 4 absent 0.0020098612
## 5 absent 0.0000000000
## 6 absent 0.0082595316
plot(ustat)

It’s possible to look at the predicted probability of the classes, instead of class names:

ustat = forest.varU(u_test_prob,fit)
head(ustat)
##       y.hat      var.hat
## 1 0.9902894 0.0002703610
## 2 0.9566679 0.0016197762
## 3 0.6873461 0.0289572081
## 4 0.9757894 0.0020932842
## 5 0.9873632 0.0002641357
## 6 0.8341363 0.0090549576
plot(ustat)

Now we look at the training set (Case (1)), both class names and probabilities:

ustat = forest.varU(fit$predictedAll,fit)
head(ustat)
##     y.hat      var.hat
## 1  absent 0.0173323135
## 2  absent 0.0023576963
## 3  absent 0.0091375044
## 4 present 0.0206843294
## 5  absent 0.0005899104
## 6  absent 0.0010125877
plot(ustat)

ustat = forest.varU(fit$predictedProb,fit)
head(ustat)
##       y.hat     var.hat
## 1 0.4651144 0.013532360
## 2 0.9375014 0.003645587
## 3 0.9085075 0.007554338
## 4 0.1901967 0.018635951
## 5 0.9810120 0.001036149
## 6 0.9476821 0.001819176
plot(ustat)

Infinitesimal jackknife based variance estimate

Again we can compare to the infinitesimal jackknife: (looks like there is a bug in the randomForestCI code (infinitesimalJackknife.R; line 144-145) in variance estimates for classification, so the code below is commented out until that package is fixed)

#rf = randomForest(x, y, keep.inbag = TRUE)
#ij = randomForestInfJack(rf, x, calibrate = TRUE)
#head(ij)
#plot(ij)

Again we run the infinitesimal jackknife code in the randomForestCI package on our forest with sampling with replacement:

fit = forest(xtrain,ytrain,var.type="infjack",ntree=5000)
ij2_train_oob = fit$predicted   # Case (1)
ij2 = forest.varIJ(fit$predictedAll,fit)
head(ij2)
##    y.hat      var.hat
## 1 1.5176 0.0466423657
## 2 1.9776 0.0014053885
## 3 1.9184 0.0076155018
## 4 1.1922 0.0426017450
## 5 1.9940 0.0009817956
## 6 1.9808 0.0012334122
plot(ij2)

Compare to randomForest package

We end by comparing our forest’s predictions to that of the randomForest package

rf = randomForest(xtrain,ytrain,keep.forest=TRUE,keep.inbag=TRUE,replace=TRUE,ntree=5000)
rf_train_oob = rf$predicted
table(ij2_train_oob,rf_train_oob)
##              rf_train_oob
## ij2_train_oob absent present
##       absent      49       0
##       present      2       5

Implementation Notes

Mathematical Notes

Some Intuition (to be fleshed out)

In a foundational paper, Breiman (2001) showed that a random forest’s sampling variance (by which we mean ….) is determined by two factors: the variance of each tree in the forest, and the correlation between trees. Intuitively, having trees that are certain about their predictions and less correlated with each other means the forest is reasonably certain .

One of the parameters driving correlation between individual trees is how many features they share. Hence, the higher the proportion of features sampled by each tree, the higher the probability that trees have features in common. However, this same parameter also affects the variance of an individual tree - the less features a tree is allowed to use, the more uncertain its prediction. Hence, ….