# How Uncertain Are Your Random Forest Predictions?

#### 2017-07-18

This vignette demonstrates how to use the surfin R package to compute uncertainty for random forest predictions. We will compare the U-statistics based variance estimate (Mentch & Hooker 2016) provided in this package with another estimate based on the infinitesimal jackknife (Wager, Hastie, Efron, 2014). The latter is provided in the R code randomForestCI by the authors of the paper. For a discussion of the differences between these two variance estimates, see Section 2.1 of Mentch et al.

Categorical predictors are currently converted to their numeric equivalents, not made into indicator variables. This feature is pending. If you have categorical predictors, for now, make them indicators before calling the forest function.

This package is actively under development. Feedback, bug reports, etc. are very much welcome! Find our contact info on the package website.

## Setup

Let’s get started! First, load packages needed for this example:

library(surfin)
library(devtools)  # to install randomForestCI package from github
library(randomForest)  # to compare forest implementations
library(rpart) # for kyphosis data
#library(MASS) # for Boston housing and breast cancer data

Next, install and load the randomForestCI R package:

install_github("swager/randomForestCI")
## Skipping install of 'randomForestCI' from a github remote, the SHA1 (8b2b84f3) has not changed since last install.
##   Use force = TRUE to force installation
library(randomForestCI)

## Regression

#data(Boston)
#x = Boston[,1:(ncol(Boston)-1)]
#y = Boston[,ncol(Boston)]
x = cu.summary[,c("Price","Country","Reliability","Type")]
y = cu.summary$Mileage keep = !is.na(y) y = y[keep] x = x[keep,] keep = !apply(is.na(x),1,any) y = y[keep] x = x[keep,] n = length(y) train = sample(1:n,n*0.7) test = setdiff(1:n,train) xtrain = x[train,] ytrain = y[train] xtest = x[test,] ytest = y[test] To compute the U-statistics based variance estimate, we use sampling without replacement and specify B, the number of common observations between trees to be 20. To give an idea of what this means, if we use ntree=5000 trees and B=20 common observations between trees, L=5000/20 = 250 trees will share an observation, then the next 250 trees with share another observation, and so forth. So two of these three parameters can be specified, and the third will automatically follow. fit = forest(xtrain,ytrain,var.type="ustat",B=20,ntree=5000) ## [1] "U-statistics based variance needs replace=FALSE. The forest was built with this option." ## [1] "U-statistics based variance needs individualTrees=TRUE. The forest was built with this option." Check out what the forest object outputs: names(fit) ## [1] "inbag.times" "oob.times" "forest" ## [4] "type" "varNames" "replace" ## [7] "var.type" "B" "sampSize" ## [10] "ntree" "individualTrees" "predicted" ## [13] "predictedAll" "predictedOOB" ### Prediction There are a variety of prediction options to choose from: 1. On the train set using only out-of-bag train observations 2. On the train set using all train observations 3. On the test set using all test observations u_train_oob = fit$predicted        # Case (1)
u_train = predict(fit,xtrain)  # Case (2)
temp = predict(fit,xtest,individualTrees=T)   # Case (3)
u_test = temp$predicted u_test_all = temp$predictedAll
temp = data.frame(u_train_oob,u_train)
head(temp)
##   u_train_oob  u_train
## 1    22.10344 21.90237
## 2    24.16341 24.15650
## 3    25.18435 25.85654
## 4    25.10695 21.05798
## 5    24.11299 23.69542
## 6    24.16176 23.98517
head(u_test)
## [1] 31.31556 30.35831 30.45509 24.10829 26.45608 23.91340

### U-statistics based variance estimate

We can calculate and plot the u-statistics based variance estimate on the test set (Case (2)):

ustat = forest.varU(u_test_all,fit)
head(ustat)
##      y.hat   var.hat
## 1 31.31556 1.3984477
## 2 30.35831 0.5925618
## 3 30.45509 1.5342394
## 4 24.10829 2.6394619
## 5 26.45608 0.7610817
## 6 23.91340 0.2698203
plot(ustat)

It’s also possible to retrieve not just the variance, but also covariance between predictions:

temp = forest.varU(u_test_all,fit,covariance=TRUE)
y.hat = temp[[1]]
cov = temp[[2]]
dim(cov)
## [1] 15 15
cov[1:6,1:6]
##             [,1]         [,2]        [,3]         [,4]        [,5]
## [1,]  1.39844768 -0.160568433  1.01207637 -0.328440364  0.02343384
## [2,] -0.16056843  0.592561777 -0.11385230  0.006699479  0.20282973
## [3,]  1.01207637 -0.113852305  1.53423943 -0.260049369  0.03713565
## [4,] -0.32844036  0.006699479 -0.26004937  2.639461891  0.60159975
## [5,]  0.02343384  0.202829732  0.03713565  0.601599751  0.76108170
## [6,] -0.29651208  0.087548214 -0.13532654 -0.150396050 -0.02450429
##             [,6]
## [1,] -0.29651208
## [2,]  0.08754821
## [3,] -0.13532654
## [4,] -0.15039605
## [5,] -0.02450429
## [6,]  0.26982026

The (i,j) element in the covariance matrix represents the estimated covariance between observation i and observation j’s predictions.

The diagonals of the covariance matrix are exactly the variances we saw above:

unique(diag(cov) - ustat[,2])
## [1] 0

We can also calculate variance estimate on the train set (Case (1)):

ij2 = forest.varIJ(fit$predictedAll,fit) head(ij2) ## y.hat var.hat ## 1 21.81143 0.11325415 ## 2 24.12591 0.08539112 ## 3 25.84118 0.47283131 ## 4 21.01991 5.87878848 ## 5 23.58064 0.18573485 ## 6 23.90902 0.16662527 plot(ij2) ### Compare to randomForest package Let’s compare our forest’s predictions to the predictions of the randomForest package: rf_train_oob = rf$predicted
plot(ij2_train_oob,rf_train_oob)
lines(ij2_train_oob,ij2_train_oob,lty="dashed")

### Sensitivity Analysis

Let’s compare the variance estimates and see how they change when more trees are used in the forest:

varU = vector("numeric")
varIJ = vector("numeric")
nts = seq(1000,10000,2000)
for (nt in nts)
{
fit = forest(xtrain,ytrain,var.type="ustat",B=20,ntree=nt)
varU = c(varU,mean(forest.varU(fit$predictedAll,fit)[,2])) rf = randomForest(xtrain, ytrain, keep.inbag = TRUE, ntree=nt) varIJ = c(varIJ,mean(randomForestInfJack(rf, xtrain, calibrate = TRUE)[,2])) } ## [1] "U-statistics based variance needs replace=FALSE. The forest was built with this option." ## [1] "U-statistics based variance needs individualTrees=TRUE. The forest was built with this option." ## [1] "U-statistics based variance needs replace=FALSE. The forest was built with this option." ## [1] "U-statistics based variance needs individualTrees=TRUE. The forest was built with this option." ## [1] "U-statistics based variance needs replace=FALSE. The forest was built with this option." ## [1] "U-statistics based variance needs individualTrees=TRUE. The forest was built with this option." ## [1] "U-statistics based variance needs replace=FALSE. The forest was built with this option." ## [1] "U-statistics based variance needs individualTrees=TRUE. The forest was built with this option." ## [1] "U-statistics based variance needs replace=FALSE. The forest was built with this option." ## [1] "U-statistics based variance needs individualTrees=TRUE. The forest was built with this option." plot(nts,varU,ylim=c(0,max(varU,varIJ)),cex.axis=0.6,ylab="Mean Est. Variance",xlab="Number of Trees",type="o",cex.lab=0.5) points(nts,varIJ,col="blue",type="o") legend("topright",legend=c("U-Stat","IJ"),col=c("black","blue"),lty="solid",cex=0.6) print(varU) ## [1] 2.4101866 0.9823698 0.7984688 0.6536290 0.6363574 print(varIJ) ## [1] 1.137313 1.096166 1.103347 1.107860 1.074623 We can also perform sensitivity analysis on the B parameter, the number of common observations between trees. varU = vector("numeric") varIJ = vector("numeric") bs = c(2,5,10,20,25) for (b in bs) { fit = forest(xtrain,ytrain,var.type="ustat",B=b,ntree=1000) varU = c(varU,mean(forest.varU(fit$predictedAll,fit)[,2]))
}
## [1] "U-statistics based variance needs replace=FALSE. The forest was built with this option."
## [1] "U-statistics based variance needs individualTrees=TRUE. The forest was built with this option."
## [1] "U-statistics based variance needs replace=FALSE. The forest was built with this option."
## [1] "U-statistics based variance needs individualTrees=TRUE. The forest was built with this option."
## [1] "U-statistics based variance needs replace=FALSE. The forest was built with this option."
## [1] "U-statistics based variance needs individualTrees=TRUE. The forest was built with this option."
## [1] "U-statistics based variance needs replace=FALSE. The forest was built with this option."
## [1] "U-statistics based variance needs individualTrees=TRUE. The forest was built with this option."
## [1] "U-statistics based variance needs replace=FALSE. The forest was built with this option."
## [1] "U-statistics based variance needs individualTrees=TRUE. The forest was built with this option."
plot(bs,varU,ylim=c(0,max(varU,varIJ)),cex.axis=0.6,ylab="Mean Est. Variance",xlab="B",type="o",cex.lab=0.5)

print(varU)
## [1] 0.2454759 0.8651213 1.1021971 2.3720810 2.9286694

## Binary Classification

Next, we try classification. Currently only binary classification is supported:

#data(biopsy)
#x = biopsy[1:(ncol(biopsy)-1)]
#y = biopsy[,ncol(biopsy)]
x = kyphosis[,c("Age","Number","Start")]
y = kyphosis$Kyphosis keep = !is.na(y) y = y[keep] x = x[keep,] keep = !apply(is.na(x),1,any) y = y[keep] x = x[keep,] n = length(y) train = sample(1:n,n*0.7) test = setdiff(1:n,train) xtrain = x[train,] ytrain = y[train] xtest = x[test,] ytest = y[test] The response is rather imbalanced: table(y) ## y ## absent present ## 64 17 ### Prediction Like for regression, we have several options for prediction. fit = forest(xtrain,ytrain,var.type="ustat",B=20,ntree=5000) ## [1] "U-statistics based variance needs replace=FALSE. The forest was built with this option." ## [1] "U-statistics based variance needs individualTrees=TRUE. The forest was built with this option." names(fit) ## [1] "inbag.times" "oob.times" "forest" ## [4] "type" "key" "varNames" ## [7] "replace" "var.type" "B" ## [10] "sampSize" "ntree" "individualTrees" ## [13] "predicted" "predictedAll" "predictedProb" ## [16] "predictedOOB" "predictedProbOOB" u_train_oob = fit$predicted        # Case (1)
table(u_train_oob)
## u_train_oob
##  absent present
##      49       7
u_train = predict(fit,xtrain)  # Case (2)
table(u_train)
## u_train
##  absent present
##      47       9
temp = predict(fit,xtest,individualTrees=T)   # Case (3)
u_test = temp$predicted u_test_prob = temp$predictedProb
u_test_all = temp$predictedAll table(u_test) ## u_test ## absent present ## 24 1 ### U-statistics based variance estimate Check out the test set variance estimate (Case (2)): ustat = forest.varU(u_test_all,fit) head(ustat) ## y.hat var.hat ## 1 absent 0.0002777746 ## 2 absent 0.0001122600 ## 3 absent 0.0318183038 ## 4 absent 0.0020098612 ## 5 absent 0.0000000000 ## 6 absent 0.0082595316 plot(ustat) It’s possible to look at the predicted probability of the classes, instead of class names: ustat = forest.varU(u_test_prob,fit) head(ustat) ## y.hat var.hat ## 1 0.9902894 0.0002703610 ## 2 0.9566679 0.0016197762 ## 3 0.6873461 0.0289572081 ## 4 0.9757894 0.0020932842 ## 5 0.9873632 0.0002641357 ## 6 0.8341363 0.0090549576 plot(ustat) Now we look at the training set (Case (1)), both class names and probabilities: ustat = forest.varU(fit$predictedAll,fit)
head(ustat)
##     y.hat      var.hat
## 1  absent 0.0173323135
## 2  absent 0.0023576963
## 3  absent 0.0091375044
## 4 present 0.0206843294
## 5  absent 0.0005899104
## 6  absent 0.0010125877
plot(ustat)

ij2 = forest.varIJ(fit$predictedAll,fit) head(ij2) ## y.hat var.hat ## 1 1.5176 0.0466423657 ## 2 1.9776 0.0014053885 ## 3 1.9184 0.0076155018 ## 4 1.1922 0.0426017450 ## 5 1.9940 0.0009817956 ## 6 1.9808 0.0012334122 plot(ij2) ### Compare to randomForest package We end by comparing our forest’s predictions to that of the randomForest package rf = randomForest(xtrain,ytrain,keep.forest=TRUE,keep.inbag=TRUE,replace=TRUE,ntree=5000) rf_train_oob = rf$predicted
table(ij2_train_oob,rf_train_oob)
##              rf_train_oob
## ij2_train_oob absent present
##       absent      49       0
##       present      2       5

## Implementation Notes

• Data with non-response, or classification with more than 2 categories is not yet supported. Contact us if you are eager for these!

• Categorical predictors are currently converted to their numeric equivalents, not made into indicator variables. This feature is pending.

• Like the randomForest package, the splitting criterion for regression is mean squared error, and gini impurity for binary classification.

## Mathematical Notes

• Minimizing gini impurity in binary classification is equivalent to minimizing mean squared error. For example, taking $$Y\in\{0,1\}$$, it can be shown that MSE = $$\frac{1}{2}$$ Gini impurity.

## Some Intuition (to be fleshed out)

In a foundational paper, Breiman (2001) showed that a random forest’s sampling variance (by which we mean ….) is determined by two factors: the variance of each tree in the forest, and the correlation between trees. Intuitively, having trees that are certain about their predictions and less correlated with each other means the forest is reasonably certain .

One of the parameters driving correlation between individual trees is how many features they share. Hence, the higher the proportion of features sampled by each tree, the higher the probability that trees have features in common. However, this same parameter also affects the variance of an individual tree - the less features a tree is allowed to use, the more uncertain its prediction. Hence, ….