Hui Fen (Sarah) Tan
PhD student, Cornell Statistics

Contact Me
h t 3 9 5 AT cornell DOT edu
Github   LinkedIn

I'm a PhD student at Cornell Statistics, minoring in Computer Science. I'm advised by Giles Hooker and Martin Wells, and Thorsten Joachims is on my committee. Previously, I studied at Berkeley and Columbia, and worked in startups and nonprofits in NYC, including the health department and public hospitals system. In 2014, I was a Data Science for Social Good fellow. I spent summer 2015 at Xerox Research (now Naver Labs) and summer 2017 at Microsoft Research Redmond, working with Rich Caruana. My work is supported by a Harmony Institute Research Fellowship and an Engaged Cornell grant.

Research Interests

Broadly, I work on inference and interpretability of machine learning methods, in particular tree ensembles, and causal inference. I particularly enjoy working on methods useful for healthcare and public policy.


Code & Data

R package surfin: (Statistical Inference for Random Forests)

Black-box risk scores data sets: coming soon

Publications & Presentations

Machine Learning

Tan, R Caruana, G Hooker, Y Lou. Detecting Bias in Black-Box Models Using Transparent Model Distillation. Under review

Tan, G Hooker, M Wells. Peeking into the Random Forest: Interpretability in Tree and Observation Space. In submission. Preliminary version: Tree Space Prototypes: Another Look at Making Tree Ensembles Interpretable. NIPS Interpretability Workshop 2016. [blog mention]

Tan, S Makela, D Heller, K Konty, S Balter, T Zheng, J Stark. Using Bayesian Evidence Synthesis to Estimate Disease Prevalence Among Hard-To-Reach Populations: Hepatitis C in New York City. Under review; presented to NYC Health Commissioner

Tan, G Hooker, M Wells. Probabilistic Matching: Incorporating Uncertainty to Correct for Selection Bias. NIPS Causal Inference Workshop 2016; journal version in submission

Tan, D Miller, J Savage. Proximity Score Matching: Using Random Forest Distance for Matching in Causal Inference. Student Paper Award, American Statistical Association SSPA section; NIPS Machine Learning in Healthcare Workshop 2015

Natural Language Processing Applications

I Vasi, E Walker, JS Johnson, Tan "No Fracking Way!" Documentary Film, Discursive Opportunity, and Local Opposition against Hydraulic Fracturing in the United States, 2010 to 2013. American Sociological Review 2015. 2 Best Paper Awards, American Sociological Association CITAMS and CBSM sections. [press release] [press release] [The Guardian] [The Atlantic] [Pacific Standard]

Older Work

Machine Learning

(Alphabetical) FB Darku, S He, MA Hossain, S Ren, Tan, I Trejo-Lorenzo. Positive Unlabeled Learning for Anomaly Detection in Nut Allergies Protein Microarray Data. IMSM 2016 project; CRSC Technical Report TR16-08. [press release] [blog post]

Tan, R Rotabi, HGT Nguyen. Using Ranking Support Vector Machines for Group Recommendations. NYAS Machine Learning Symposium 2015

(Alphabetical) S Abraham, J Lockhart, Tan, R Turner, Y Kim. Identifying At-Risk Mothers for Targeted Interventions. KDD 2014 Session on Data Science for Social Good. [blog post] [presentation]

Statistical Methods for Healthcare

Tan, R Low, S Ito, R Gregory, L Bielory, V Dunn. Two Ways of Modeling Hospital Readmissions: Mixed and Marginal Models. Proceedings of JSM 2013

Tan, R Low, S Ito, R Gregory, V Dunn. Drug Interactions of Beta Blockers and Beta Agonists and Their Association with Hospital Admissions. Proceedings of SAS GF 2013

R Low, S Ito, R Gregory, L Rassi, Tan, C Jacobs. Hospital Readmission Rates: Related To ED Volume, Population, And Economic Variables. Academic Emergency Medicine 2012


Fun Stuff