ML for Systems

Our vision for research on ML for Systems is laid out in SageDB, a new type of data processing system that highly specializes to a particular application through code synthesis and machine learning. This vision is also a focus of MIT DSAIL.

We provide an overview of data systems components that we are currently working on, with more detailed project descriptions in the links, as well as a list of open-source repositories:

Data Access

Storage layout and index structures are the most important factors to guarantee efficient data access, and both are amenable to be enhanced by learned data and workload models. The original work on learned indexes showed that replacing traditional index structures with learned models can improve data access times while reducing memory footprint. Since then, we have worked to make learned indexes more practical, developed learned multi-dimensional indexes, created a benchmark for learned index structures, and applied learned indexes to DNA sequence search.


Query Execution

Learning the data distribution can also be used to speed up query execution. In particular, we are applying learned techniques to sorting, scheduling, and joins.


Query Optimization

Traditional query optimizers are extremely hard to build, maintain, and often yield sub-optimal query plans. The brittleness and complexity of the optimizer makes it a good candidate to be learned. We have introduced the first end-to-end learned query optimizer, and we are currently exploring how to generalize it to unseen data and schemas. We are also exploring learning-based cardinality estimation techniques.


Open Source

We have open-sourced a number of our projects, including SOSD, a benchmark for learned index structures; a reference implementation of the recursive model index (RMI) described in the original learned index paper; and Park, a platform for researchers to experiment with Reinforcement Learning (RL) for various systems problems.

Open-source repositories (with associated publication):