Surfin

Statistical Inference for Random Forests

Download .tar.gz View on GitHub

Description

This R package computes uncertainty for random forest predictions using a fast implementation of random forests in C++. This is an exciting time for research into the theoretical properties of random forests. This R package aims to provide all state-of-the-art variance estimates in one place, to expedite research in this area and make it easier for practitioners to compare estimates.

Two variance estimates are provided: U-statistics based (Mentch & Hooker, 2016) and infinitesimal jackknife on bootstrap samples (Wager, Hastie, Efron, 2014), the latter as a wrapper to the authors' R code randomForestCI.

Check out a demo: How Uncertain Are Your Random Forest Predictions?

Updates

More variance estimates coming soon: (1) Bootstrap-of-little-bags (Sexton and Laake 2009) (2) Infinitesimal jackknife on subsamples (Wager & Athey, 2017; Athey, Tibshirani, Wager, 2016) as a wrapper to the authors' R package grf.

This package is actively under development. Feedback, bug reports, pointers to other variance estimates very much welcome! Email me.

Installation

Download the .tar.gz (whether you are on Mac, Linux, or Windows). Note that it is a source, not binary file, and needs to be compiled using C++ development tools. If you don't already have the dependencies (Rcpp, RcppArmadillo, Matrix, knitr) and optional dependencies (randomForest, rpart) installed, install those from CRAN first. If you are on Windows, make sure you have RTools installed for C++ development tools. If you already have an older version of surfin installed, remove that first by typing the following in R:

$ remove.packages("surfin")

Then, within base R (not RStudio), install using:

$ install.packages(path_to_downloaded_file, repos=NULL, type="source")
$ library(surfin)

While surfin installation is currently incompatible with RStudio, once it is installed (using base R), it can be ran from Rstudio. Please email me if you encounter any installation issues.

Once installed, you can see surfin's help file by typing:

$ ?surfin

in R, or check out the demo.

References

Mentch L & Hooker G. Quantifying uncertainty in random forests via confidence intervals and hypothesis tests. Journal of Machine Learning Research. 2016.

Wager S, Hastie T, Efron B. Confidence intervals for random forests: the jackknife and the infinitesimal jackknife. Journal of Machine Learning Research. 2014.

Sexton J & Laake P. Standard errors for bagged and random forest estimators. Journal of Computational Statistics & Data Analysis. 2009.

Wager S & Athey S. Estimation and Inference of Heterogeneous Treatment Effects using Random Forests. Journal of the American Statistical Association. 2018.

Athey S, Tibshirani, J, Wager, S. Generalized Random Forests. The Annals of Statistics. 2019.

Authors and Contributors

Sarah Tan @shftan, David Miller @d-miller, Giles Hooker @gileshooker, Lucas Mentch @LMentch

Maintainer

Sarah Tan. Email me.