This vignette demonstrates how to use the surfin R package to compute uncertainty for random forest predictions. We will compare the U-statistics based variance estimate (Mentch & Hooker 2016) provided in this package with another estimate based on the infinitesimal jackknife (Wager, Hastie, Efron, 2014). The latter is provided in the R code randomForestCI by the authors of the paper. For a discussion of the differences between these two variance estimates, see Section 2.1 of Mentch et al.

Categorical predictors are currently converted to their numeric equivalents, not made into indicator variables. This feature is pending. If you have categorical predictors, for now, make them indicators before calling the forest function.

This package is actively under development. Feedback, bug reports, etc. are very much welcome! Find our contact info on the package website.

Let’s get started! First, load packages needed for this example:

```
library(surfin)
library(devtools) # to install randomForestCI package from github
library(randomForest) # to compare forest implementations
library(rpart) # for kyphosis data
#library(MASS) # for Boston housing and breast cancer data
```

Next, install and load the randomForestCI R package:

`install_github("swager/randomForestCI")`

```
## Skipping install of 'randomForestCI' from a github remote, the SHA1 (8b2b84f3) has not changed since last install.
## Use `force = TRUE` to force installation
```

`library(randomForestCI)`

We start with a regression example:

```
#data(Boston)
#x = Boston[,1:(ncol(Boston)-1)]
#y = Boston[,ncol(Boston)]
x = cu.summary[,c("Price","Country","Reliability","Type")]
y = cu.summary$Mileage
keep = !is.na(y)
y = y[keep]
x = x[keep,]
keep = !apply(is.na(x),1,any)
y = y[keep]
x = x[keep,]
n = length(y)
train = sample(1:n,n*0.7)
test = setdiff(1:n,train)
xtrain = x[train,]
ytrain = y[train]
xtest = x[test,]
ytest = y[test]
```

To compute the U-statistics based variance estimate, we use sampling without replacement and specify B, the number of common observations between trees to be 20. To give an idea of what this means, if we use ntree=5000 trees and B=20 common observations between trees, L=5000/20 = 250 trees will share an observation, then the next 250 trees with share another observation, and so forth. So two of these three parameters can be specified, and the third will automatically follow.

`fit = forest(xtrain,ytrain,var.type="ustat",B=20,ntree=5000)`

```
## [1] "U-statistics based variance needs replace=FALSE. The forest was built with this option."
## [1] "U-statistics based variance needs individualTrees=TRUE. The forest was built with this option."
```

Check out what the forest object outputs:

`names(fit)`

```
## [1] "inbag.times" "oob.times" "forest"
## [4] "type" "varNames" "replace"
## [7] "var.type" "B" "sampSize"
## [10] "ntree" "individualTrees" "predicted"
## [13] "predictedAll" "predictedOOB"
```

There are a variety of prediction options to choose from:

On the train set using only out-of-bag train observations

On the train set using all train observations

On the test set using all test observations

```
u_train_oob = fit$predicted # Case (1)
u_train = predict(fit,xtrain) # Case (2)
temp = predict(fit,xtest,individualTrees=T) # Case (3)
u_test = temp$predicted
u_test_all = temp$predictedAll
temp = data.frame(u_train_oob,u_train)
head(temp)
```

```
## u_train_oob u_train
## 1 22.10344 21.90237
## 2 24.16341 24.15650
## 3 25.18435 25.85654
## 4 25.10695 21.05798
## 5 24.11299 23.69542
## 6 24.16176 23.98517
```

`head(u_test)`

`## [1] 31.31556 30.35831 30.45509 24.10829 26.45608 23.91340`

We can calculate and plot the u-statistics based variance estimate on the test set (Case (2)):

```
ustat = forest.varU(u_test_all,fit)
head(ustat)
```

```
## y.hat var.hat
## 1 31.31556 1.3984477
## 2 30.35831 0.5925618
## 3 30.45509 1.5342394
## 4 24.10829 2.6394619
## 5 26.45608 0.7610817
## 6 23.91340 0.2698203
```

`plot(ustat)`

It’s also possible to retrieve not just the variance, but also covariance between predictions:

```
temp = forest.varU(u_test_all,fit,covariance=TRUE)
y.hat = temp[[1]]
cov = temp[[2]]
dim(cov)
```

`## [1] 15 15`

`cov[1:6,1:6]`

```
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1.39844768 -0.160568433 1.01207637 -0.328440364 0.02343384
## [2,] -0.16056843 0.592561777 -0.11385230 0.006699479 0.20282973
## [3,] 1.01207637 -0.113852305 1.53423943 -0.260049369 0.03713565
## [4,] -0.32844036 0.006699479 -0.26004937 2.639461891 0.60159975
## [5,] 0.02343384 0.202829732 0.03713565 0.601599751 0.76108170
## [6,] -0.29651208 0.087548214 -0.13532654 -0.150396050 -0.02450429
## [,6]
## [1,] -0.29651208
## [2,] 0.08754821
## [3,] -0.13532654
## [4,] -0.15039605
## [5,] -0.02450429
## [6,] 0.26982026
```

The (i,j) element in the covariance matrix represents the estimated covariance between observation i and observation j’s predictions.

The diagonals of the covariance matrix are exactly the variances we saw above:

`unique(diag(cov) - ustat[,2])`

`## [1] 0`

We can also calculate variance estimate on the train set (Case (1)):

```
ustat = forest.varU(fit$predictedAll,fit)
head(ustat)
```

```
## y.hat var.hat
## 1 21.90237 0.1105814
## 2 24.15650 0.2026405
## 3 25.85654 0.7319464
## 4 21.05798 5.2741773
## 5 23.69542 0.1236865
## 6 23.98517 0.1528721
```

`plot(ustat)`

Now we compare to the infinitesimal jackknife based variance using the randomForestCI package:

```
rf = randomForest(xtrain, ytrain, keep.inbag = TRUE, ntree=5000)
ij = randomForestInfJack(rf, xtrain, calibrate = TRUE)
head(ij)
```

```
## y.hat var.hat
## Oldsmobile Cutlass Ciera 4 21.64513 0.08579727
## Mazda 626 4 24.31732 0.15047229
## Honda Prelude Si 4WS 4 25.14416 0.93355101
## Dodge Grand Caravan V6 20.80192 2.07210200
## Buick Skylark 4 23.87048 0.28998848
## Ford Tempo 4 24.16110 0.22380466
```

`plot(ij)`

Next we try calling the infinitesimal jackknife code in the randomForestCI package, but using our forest where sampling is performed with replacement:

```
fit = forest(xtrain,ytrain,var.type="infjack",ntree=5000)
ij2_train_oob = fit$predicted # Case (1)
ij2 = forest.varIJ(fit$predictedAll,fit)
head(ij2)
```

```
## y.hat var.hat
## 1 21.81143 0.11325415
## 2 24.12591 0.08539112
## 3 25.84118 0.47283131
## 4 21.01991 5.87878848
## 5 23.58064 0.18573485
## 6 23.90902 0.16662527
```

`plot(ij2)`

Let’s compare our forest’s predictions to the predictions of the randomForest package:

```
rf_train_oob = rf$predicted
plot(ij2_train_oob,rf_train_oob)
lines(ij2_train_oob,ij2_train_oob,lty="dashed")
```

Let’s compare the variance estimates and see how they change when more trees are used in the forest:

```
varU = vector("numeric")
varIJ = vector("numeric")
nts = seq(1000,10000,2000)
for (nt in nts)
{
fit = forest(xtrain,ytrain,var.type="ustat",B=20,ntree=nt)
varU = c(varU,mean(forest.varU(fit$predictedAll,fit)[,2]))
rf = randomForest(xtrain, ytrain, keep.inbag = TRUE, ntree=nt)
varIJ = c(varIJ,mean(randomForestInfJack(rf, xtrain, calibrate = TRUE)[,2]))
}
```

```
## [1] "U-statistics based variance needs replace=FALSE. The forest was built with this option."
## [1] "U-statistics based variance needs individualTrees=TRUE. The forest was built with this option."
## [1] "U-statistics based variance needs replace=FALSE. The forest was built with this option."
## [1] "U-statistics based variance needs individualTrees=TRUE. The forest was built with this option."
## [1] "U-statistics based variance needs replace=FALSE. The forest was built with this option."
## [1] "U-statistics based variance needs individualTrees=TRUE. The forest was built with this option."
## [1] "U-statistics based variance needs replace=FALSE. The forest was built with this option."
## [1] "U-statistics based variance needs individualTrees=TRUE. The forest was built with this option."
## [1] "U-statistics based variance needs replace=FALSE. The forest was built with this option."
## [1] "U-statistics based variance needs individualTrees=TRUE. The forest was built with this option."
```

```
plot(nts,varU,ylim=c(0,max(varU,varIJ)),cex.axis=0.6,ylab="Mean Est. Variance",xlab="Number of Trees",type="o",cex.lab=0.5)
points(nts,varIJ,col="blue",type="o")
legend("topright",legend=c("U-Stat","IJ"),col=c("black","blue"),lty="solid",cex=0.6)
```

`print(varU)`

`## [1] 2.4101866 0.9823698 0.7984688 0.6536290 0.6363574`

`print(varIJ)`

`## [1] 1.137313 1.096166 1.103347 1.107860 1.074623`

We can also perform sensitivity analysis on the B parameter, the number of common observations between trees.

```
varU = vector("numeric")
varIJ = vector("numeric")
bs = c(2,5,10,20,25)
for (b in bs)
{
fit = forest(xtrain,ytrain,var.type="ustat",B=b,ntree=1000)
varU = c(varU,mean(forest.varU(fit$predictedAll,fit)[,2]))
}
```

```
## [1] "U-statistics based variance needs replace=FALSE. The forest was built with this option."
## [1] "U-statistics based variance needs individualTrees=TRUE. The forest was built with this option."
## [1] "U-statistics based variance needs replace=FALSE. The forest was built with this option."
## [1] "U-statistics based variance needs individualTrees=TRUE. The forest was built with this option."
## [1] "U-statistics based variance needs replace=FALSE. The forest was built with this option."
## [1] "U-statistics based variance needs individualTrees=TRUE. The forest was built with this option."
## [1] "U-statistics based variance needs replace=FALSE. The forest was built with this option."
## [1] "U-statistics based variance needs individualTrees=TRUE. The forest was built with this option."
## [1] "U-statistics based variance needs replace=FALSE. The forest was built with this option."
## [1] "U-statistics based variance needs individualTrees=TRUE. The forest was built with this option."
```

`plot(bs,varU,ylim=c(0,max(varU,varIJ)),cex.axis=0.6,ylab="Mean Est. Variance",xlab="B",type="o",cex.lab=0.5)`

`print(varU)`

`## [1] 0.2454759 0.8651213 1.1021971 2.3720810 2.9286694`

Next, we try classification. Currently only binary classification is supported:

```
#data(biopsy)
#x = biopsy[1:(ncol(biopsy)-1)]
#y = biopsy[,ncol(biopsy)]
x = kyphosis[,c("Age","Number","Start")]
y = kyphosis$Kyphosis
keep = !is.na(y)
y = y[keep]
x = x[keep,]
keep = !apply(is.na(x),1,any)
y = y[keep]
x = x[keep,]
n = length(y)
train = sample(1:n,n*0.7)
test = setdiff(1:n,train)
xtrain = x[train,]
ytrain = y[train]
xtest = x[test,]
ytest = y[test]
```

The response is rather imbalanced:

`table(y)`

```
## y
## absent present
## 64 17
```

Like for regression, we have several options for prediction.

`fit = forest(xtrain,ytrain,var.type="ustat",B=20,ntree=5000)`

```
## [1] "U-statistics based variance needs replace=FALSE. The forest was built with this option."
## [1] "U-statistics based variance needs individualTrees=TRUE. The forest was built with this option."
```

`names(fit)`

```
## [1] "inbag.times" "oob.times" "forest"
## [4] "type" "key" "varNames"
## [7] "replace" "var.type" "B"
## [10] "sampSize" "ntree" "individualTrees"
## [13] "predicted" "predictedAll" "predictedProb"
## [16] "predictedOOB" "predictedProbOOB"
```

```
u_train_oob = fit$predicted # Case (1)
table(u_train_oob)
```

```
## u_train_oob
## absent present
## 49 7
```

```
u_train = predict(fit,xtrain) # Case (2)
table(u_train)
```

```
## u_train
## absent present
## 47 9
```

```
temp = predict(fit,xtest,individualTrees=T) # Case (3)
u_test = temp$predicted
u_test_prob = temp$predictedProb
u_test_all = temp$predictedAll
table(u_test)
```

```
## u_test
## absent present
## 24 1
```

Check out the test set variance estimate (Case (2)):

```
ustat = forest.varU(u_test_all,fit)
head(ustat)
```

```
## y.hat var.hat
## 1 absent 0.0002777746
## 2 absent 0.0001122600
## 3 absent 0.0318183038
## 4 absent 0.0020098612
## 5 absent 0.0000000000
## 6 absent 0.0082595316
```

`plot(ustat)`

It’s possible to look at the predicted probability of the classes, instead of class names:

```
ustat = forest.varU(u_test_prob,fit)
head(ustat)
```

```
## y.hat var.hat
## 1 0.9902894 0.0002703610
## 2 0.9566679 0.0016197762
## 3 0.6873461 0.0289572081
## 4 0.9757894 0.0020932842
## 5 0.9873632 0.0002641357
## 6 0.8341363 0.0090549576
```

`plot(ustat)`

Now we look at the training set (Case (1)), both class names and probabilities:

```
ustat = forest.varU(fit$predictedAll,fit)
head(ustat)
```

```
## y.hat var.hat
## 1 absent 0.0173323135
## 2 absent 0.0023576963
## 3 absent 0.0091375044
## 4 present 0.0206843294
## 5 absent 0.0005899104
## 6 absent 0.0010125877
```

`plot(ustat)`

```
ustat = forest.varU(fit$predictedProb,fit)
head(ustat)
```

```
## y.hat var.hat
## 1 0.4651144 0.013532360
## 2 0.9375014 0.003645587
## 3 0.9085075 0.007554338
## 4 0.1901967 0.018635951
## 5 0.9810120 0.001036149
## 6 0.9476821 0.001819176
```

`plot(ustat)`

Again we can compare to the infinitesimal jackknife: (looks like there is a bug in the randomForestCI code (infinitesimalJackknife.R; line 144-145) in variance estimates for classification, so the code below is commented out until that package is fixed)

```
#rf = randomForest(x, y, keep.inbag = TRUE)
#ij = randomForestInfJack(rf, x, calibrate = TRUE)
#head(ij)
#plot(ij)
```

Again we run the infinitesimal jackknife code in the randomForestCI package on our forest with sampling with replacement:

```
fit = forest(xtrain,ytrain,var.type="infjack",ntree=5000)
ij2_train_oob = fit$predicted # Case (1)
ij2 = forest.varIJ(fit$predictedAll,fit)
head(ij2)
```

```
## y.hat var.hat
## 1 1.5176 0.0466423657
## 2 1.9776 0.0014053885
## 3 1.9184 0.0076155018
## 4 1.1922 0.0426017450
## 5 1.9940 0.0009817956
## 6 1.9808 0.0012334122
```

`plot(ij2)`

We end by comparing our forest’s predictions to that of the randomForest package

```
rf = randomForest(xtrain,ytrain,keep.forest=TRUE,keep.inbag=TRUE,replace=TRUE,ntree=5000)
rf_train_oob = rf$predicted
table(ij2_train_oob,rf_train_oob)
```

```
## rf_train_oob
## ij2_train_oob absent present
## absent 49 0
## present 2 5
```

Data with non-response, or classification with more than 2 categories is not yet supported. Contact us if you are eager for these!

Categorical predictors are currently converted to their numeric equivalents, not made into indicator variables. This feature is pending.

Like the randomForest package, the splitting criterion for regression is mean squared error, and gini impurity for binary classification.

- Minimizing gini impurity in binary classification is equivalent to minimizing mean squared error. For example, taking \(Y\in\{0,1\}\), it can be shown that MSE = \(\frac{1}{2}\) Gini impurity.

In a foundational paper, Breiman (2001) showed that a random forest’s sampling variance (by which we mean ….) is determined by two factors: the variance of each tree in the forest, and the correlation between trees. Intuitively, having trees that are certain about their predictions and less correlated with each other means the forest is reasonably certain .

One of the parameters driving correlation between individual trees is how many features they share. Hence, the higher the proportion of features sampled by each tree, the higher the probability that trees have features in common. However, this same parameter also affects the variance of an individual tree - the less features a tree is allowed to use, the more uncertain its prediction. Hence, ….