ML for Systems
Our vision for research on ML for Systems is laid out in SageDB, a new type of data processing system that highly specializes to a particular application through code synthesis and machine learning. This vision is also a focus of MIT DSAIL.
We provide an overview of data systems components that we are currently working on, with more detailed project descriptions in the links, as well as a list of open-source repositories:
Storage layout and index structures are the most important factors to guarantee efficient data access, and both are amenable to be enhanced by learned data and workload models. The original work on learned indexes showed that replacing traditional index structures with learned models can improve data access times while reducing memory footprint. Since then, we have worked to make learned indexes more practical, developed learned multi-dimensional indexes, created a benchmark for learned index structures, and applied learned indexes to DNA sequence search.
- The Case for Learned Index Structures. SIGMOD 2018. Tim Kraska, Alex Beutel, Ed Chi, Jeffrey Dean, Neoklis Polyzotis
- SOSD: A Benchmark for Learned Indexes. NeurIPS Workshop on ML for Systems 2019. Andreas Kipf*, Ryan Marcus*, Alexander van Renen*, Mihail Stoian, Alfons Kemper, Tim Kraska, Thomas Neumann
- LISA: Towards Learned DNA Sequence Search. NeurIPS Workshop on Systems for ML 2019. Darryl Ho, Jialin Ding, Sanchit Misra, Nesime Tatbul, Vikram Nathan, Vasimuddin Md, Tim Kraska
- Learning Multi-dimensional Indexes. SIGMOD 2020. Vikram Nathan*, Jialin Ding*, Mohammad Alizadeh, Tim Kraska
- CDFShop: Exploring and Optimizing Learned Index Structures. SIGMOD 2020 (demo). Ryan Marcus, Emily Zhang, Tim Kraska
- ALEX: An Updatable Adaptive Learned Index. Preprint. Jialin Ding, Umar Farooq Minhas, Hantian Zhang, Yinan Li, Chi Wang, Badrish Chandramouli, Johannes Gehrke, Donald Kossmann, David Lomet
- Learning Scheduling Algorithms for Data Processing Clusters. SIGCOMM 2019. Hongzi Mao, Malte Schwarzkopf, Shaileshh Bojja Venkatakrishnan, Zili Meng, Mohammad Alizadeh
- The Case for a Learned Sorting Algorithm. SOSP Workshop on AI Systems 2019. Ani Kristo*, Kapil Vaidya*, Ugur Cetintemel, Tim Kraska
Traditional query optimizers are extremely hard to build, maintain, and often yield sub-optimal query plans. The brittleness and complexity of the optimizer makes it a good candidate to be learned. We have introduced the first end-to-end learned query optimizer, and we are currently exploring how to generalize it to unseen data and schemas. We are also exploring learning-based cardinality estimation techniques.
- Neo: A Learned Query Optimizer. VLDB 2019. Ryan Marcus, Parimarjan Negi, Hongzi Mao, Chi Zhang, Mohammad Alizadeh, Tim Kraska, Olga Papaemmanouil, Nesime Tatbul
- Cost-Guided Cardinality Estimation: Focus Where it Matters. SMDB 2020. Parimarjan Negi, Ryan Marcus, Hongzi Mao, Nesime Tatbul, Tim Kraska, Mohammad Alizadeh
We have open-sourced a number of our projects, including SOSD, a benchmark for learned index structures; a reference implementation of the recursive model index (RMI) described in the original learned index paper; and Park, a platform for researchers to experiment with Reinforcement Learning (RL) for various systems problems.
Open-source repositories (with associated publication):
- Search on Sorted Data Benchmark (SOSD). https://github.com/learnedsystems/sosd. (Link to NeurIPS 2019 workshop paper)
- Reference implementation of recursive model index (RMI). https://github.com/learnedsystems/RMI. (Link to NeurIPS 2019 workshop paper)
- Park: An Open Platform for Learning Augmented Computer Systems. https://github.com/park-project/park. (Link to NeurIPS 2019 paper)