Hui Fen (Sarah) Tan
PhD student, Cornell Statistics

Contact Me
h t 3 9 5 AT cornell DOT edu
Github   LinkedIn

I'm a PhD student at Cornell Statistics, minoring in Computer Science. I'm advised by Giles Hooker and Martin Wells, and Thorsten Joachims is on my committee. Previously, I studied at Berkeley and Columbia, and worked in startups and nonprofits in NYC, including the health department and public hospitals system. In 2014, I was a Data Science for Social Good fellow. I spent summer 2015 at Xerox Research (now Naver Labs) and summer 2017 at Microsoft Research Redmond, working with Rich Caruana.

Research Interests

Broadly, I work on inference and interpretability of machine learning methods, in particular tree ensembles, and causal inference.


Code & Data

Publications & Presentations

Machine Learning

Tan, R Caruana, G Hooker, Y Lou. Detecting Bias in Black-Box Models Using Transparent Model Distillation. Under review. Spotlight at NIPS Interpretability Symposium 2017 [MIT Technology Review]

Tan, S Makela, D Heller, K Konty, S Balter, T Zheng, J Stark. Using Bayesian Evidence Synthesis to Estimate Disease Prevalence Among Hard-To-Reach Populations: Hepatitis C in New York City. Under review. Presented to NYC Health Commissioner. Talk at National Development and Research Institutes

Tan, G Hooker, M Wells. Probabilistic Matching: Incorporating Uncertainty to Correct for Selection Bias. In submission. Preliminary version in NIPS Causal Inference Workshop 2016

Tan, G Hooker, M Wells. Peeking into the Random Forest: Interpretability in Tree and Observation Space. In submission. Preliminary version: Tree Space Prototypes: Another Look at Making Tree Ensembles Interpretable. NIPS Interpretability Workshop 2016. [blog mention]

Tan, D Miller, J Savage. Proximity Score Matching: Using Random Forest Distance for Matching in Causal Inference. Student Paper Award, American Statistical Association SSPA section. Lightning talk, Atlantic Causal Inference Conference 2015. NIPS Machine Learning in Healthcare Workshop 2015

Natural Language Processing

(Equal contribution) S Seto*, Tan*, G Hooker, M Wells. A Double Parametric Bootstrap Test for Topic Models. NIPS Interpretability Symposium 2017

I Vasi, E Walker, JS Johnson, Tan "No Fracking Way!" Documentary Film, Discursive Opportunity, and Local Opposition against Hydraulic Fracturing in the United States, 2010 to 2013. American Sociological Review 2015. 2 Best Paper Awards, American Sociological Association CITAMS and CBSM sections. [press release] [press release] [The Guardian] [The Atlantic] [Pacific Standard]

Older Work

Machine Learning

(Alphabetical) FB Darku, S He, MA Hossain, S Ren, Tan, I Trejo-Lorenzo. Positive Unlabeled Learning for Anomaly Detection in Nut Allergies Protein Microarray Data. IMSM 2016 project; CRSC Technical Report TR16-08. Talk at Statistical and Applied Mathematical Sciences Institute. [press release] [blog post]

Tan, R Rotabi, HGT Nguyen. Using Ranking Support Vector Machines for Group Recommendations. NYAS Machine Learning Symposium 2015

(Alphabetical) S Abraham, J Lockhart, Tan, R Turner, Y Kim. Identifying At-Risk Mothers for Targeted Interventions. KDD 2014 Session on Data Science for Social Good. Talk at Chicago Python User Group. [blog post] [presentation]

Statistical Methods for Healthcare

Tan, R Low, S Ito, R Gregory, L Bielory, V Dunn. Two Ways of Modeling Hospital Readmissions: Mixed and Marginal Models. Proceedings of JSM 2013

Tan, R Low, S Ito, R Gregory, V Dunn. Drug Interactions of Beta Blockers and Beta Agonists and Their Association with Hospital Admissions. Proceedings of SAS GF 2013

R Low, S Ito, R Gregory, L Rassi, Tan, C Jacobs. Hospital Readmission Rates: Related To ED Volume, Population, And Economic Variables. Academic Emergency Medicine 2012


Awards & Grants

Fun Stuff