research
Below is my (mostly) complete bibliography with links to articles, presentations, posters, videos, and suggested citations. Make sure to check my blog as well for additional informal discussion of some of this research.
Projects
Collaborative data science development
While the open-source model for software development has led to successful, large-scale collaborations in building software applications, chess engines, and scientific analyses, data science has not benefited from this development paradigm. In part, this is due to the divide between the development processes used by software engineers and those used by data scientists.
Ballet tries to address this disparity. It is a lightweight software framework that supports collaborative data science development by composing a data science pipeline from a collection of modular patches that can be written in parallel. Ballet provides the underlying functionality to support interactive development, test and merge high-quality contributions, and compose the accepted contributions into a single product.
I've led development of the core Ballet framework, the Assemblé development environment, the Ballet Bot, and a bunch of other software.
We've evaluated Ballet in an extensive case study analysis of a personal income prediction project, and describe our ideas for collaborative data science development, the design of the framework, and the results of this evaluation in our preprint.
Frameworks for AutoML
In our experience developing and deploying ML systems in my research group, we realized that every project used a different set of libraries depending on the task at hand that fit together more or less poorly. To address this, we redesign our systems building approach to one based on the concepts ML primitives, ML pipelines, and AutoML components. The resulting software framework is used for everything from our entry to DARPA's Data-Driven Discovery of Models program to unsupervised time-series anomaly detection in satellite telemetry to ML on electronic health records. I designed the BTB library for model selection and hyperparameter tuning which has also been contributed to by many folks in the Data to AI Lab. We describe the framework, some of the ML and AutoML systems we have built with it, and a thorough evaluation in this paper.
Systems for AutoML
I am a developer on the ATM project, a full-fledged open-source system for joint model selection and hyperparameter tuning for classification. ATM is one of the first projects from the research community that went beyond the creation of libraries for model selection or hyperparameter tuning to create a system with a database backend designed for ease of use and high performance. On top of this, we collaborated with the VisLab at HKUST to create a frontend for ATM that allows users to monitor and control an ongoing AutoML search process. This led to the ATMSeer system which we describe in this paper.
Publications
"AutoML to Date and Beyond: Challenges and Opportunities." ACM Computing Surveys. 2021. (Also published at arXiv:2010.10777 [cs])
"Enabling Collaborative Data Science Development with the Ballet Framework." Proceedings of the ACM on Human-Computer Interaction. 2021. (Also published at arXiv:2012.07816 [cs])
"Collaborative, Open, and Automated Data Science." Thesis. 2021.
"Meeting in the Notebook: A Notebook-Based Environment for Micro-Submissions in Data Science Collaborations." arXiv:2103.15787 [cs]. 2021.
"The Machine Learning Bazaar: Harnessing the ML Ecosystem for Effective System Development." Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. 2020. (Also published at arXiv:1905.08942 [cs])
"Understanding User-Bot Interactions for Small-Scale Automation in Open-Source Development." Extended Abstracts of the 2020 CHI Conference on Human Factors in Computing Systems. 2020.
"Demonstration of Ballet: A Framework for Open-Source Collaborative Feature Engineering." Proceedings of the 3rd MLSys Conference. 2020.
"ATMSeer: Increasing Transparency and Controllability in Automated Machine Learning." Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems. 2019. (Also published at arXiv:1902.05009 [cs])
"Ballet: A Lightweight Framework for Open-Source, Collaborative Feature Engineering." Workshop on Systems for Machine Learning and Open Source Software at NeuRIPS 2018. 2018.
"Scaling Collaborative Open Data Science." Thesis. 2018.
"FeatureHub: Towards Collaborative Data Science." 2017 IEEE International Conference on Data Science and Advanced Analytics (DSAA). 2017.
"Query Optimization for Dynamic Imputation." Proceedings of the VLDB Endowment. 2017.
"The Great Escape? A Quantitative Evaluation of the Fed's Liquidity Facilities." American Economic Review. 2017. (Substantial contribution)
"The Macro Effects of the Recent Swing in Financial Conditions." Report. 2016.
"The FRBNY DSGE Model Meets Julia." Article. 2015.
"The DSGE MATLAB to Julia Transition: Improvements and Challenges." Article. 2015.
"The FRBNY DSGE Model Forecast - November 2015." Article. 2015.
"Just Released: 2015 SCE Housing Survey Shows Households Optimistic about Housing Market." Article. 2015.
"Survey of Consumer Expectations: Housing Survey - 2015: Report." Article. 2015.
"Why are interest rates so low?." Article. 2015.
"The FRBNY DSGE Model Forecast - April 2015." Article. 2015.
"The forward guidance puzzle." Report. 2012. (Substantial contribution)