Learned Multi-Dimensional Index for Date Warehouses

Given a multi-dimensional table in a data warehouse, an analyst will often want to run queries on subsets of those dimensions. For example, given a table that tracks employee attributes such as age, salary, level, and start date, one possible range query is, “Return all employees between ages 25-30 with salary between $90K and $100K."

Current methods for indexing multi-dimensional data are often unsatisfactory when there are many dimensions that are used in small combinations in queries. In practice, an index is built on the most important dimension, secondary indexes are built on some other dimensions, and infrequently-queried dimensions are simply not indexed. However, the decision of which dimensions to index is ad hoc and based on hand-tuned parameters.

In this project, we are creating a multi-dimensional learned index that uses knowledge of the dataset and the query workload to automatically create an optimal index. Specifically, the learned index optimally rearranges the data storage order to achieve low scan overhead. On a workload derived from the TPC-H benchmark, the multi-dimensional learned index achieves up to 34X faster queries than a clustered column index and 1000X faster than an index-less column scan, while also maintaining an index size many orders of magnitude smaller than the size of the dataset.