BRAD: Simplifying Cloud Data Processing with Learned Automated Data Meshes
The last decade of database research has led to the prevalence of specialized systems for different workloads. Consequently, organizations often rely on a combination of specialized systems, organized in a Data Mesh. Data meshes present significant challenges for system administrators, including picking the right system for each workload, moving data between systems, maintaining consistency, and correctly configuring each system. Many non-expert end users (e.g., data analysts or app developers) either cannot solve their business problems, or suffer from sub-optimal performance or cost due to this complexity. We envision BRAD, a cloud system that automatically integrates and manages data and systems into an instance-optimized data mesh, allowing users to efficiently store and query data under a unified data model (i.e., relational tables) without knowledge of underlying system details. With machine learning, BRAD automatically deduces the strengths and weaknesses of each engine through a combination of offline training and online probing. Then, BRAD uses these insights to route queries to the most suitable (combination of) system(s) for efficient execution. Furthermore, BRAD automates configuration tuning, resource scaling, and data migration across component systems, and makes recommendations for more impactful decisions, such as adding or removing systems. As such, BRAD exemplifies a new class of systems that utilize machine learning and the cloud to make complex data processing more accessible to end users, raising numerous new problems in database systems, machine learning, and the cloud.
FactorJoin Cardinality Estimation
Cardinality estimation is one of the most fundamental and challenging problems in query optimization. Neither classical nor learning-based methods yield satisfactory performance when estimating the cardinality of the join queries. They either rely on simplified assumptions leading to ineffective cardinality estimates or build large models to understand the data distributions, leading to long planning times and a lack of generalizability across queries. We propose a new framework FactorJoin for estimating join queries. FactorJoin combines the idea behind the classical join-histogram method to efficiently handle joins with the learning-based methods to accurately capture attribute correlation. Specifically, FactorJoin scans every table in a DB and builds single-table conditional distributions during an offline preparation phase. When a join query comes, FactorJoin translates it into a factor graph model over the learned distributions to effectively and efficiently estimate its cardinality. Unlike existing learning-based methods, FactorJoin does not need to de-normalize joins upfront or require executed query workloads to train the model. Since it only relies on single-table statistics, FactorJoin has small space overhead and is extremely easy to train and maintain. In our evaluation, FactorJoin can produce more effective estimates than the previous state-of-the-art learning-based methods, with 40x less estimation latency, 100x smaller model size, and 100x faster training speed at comparable or better accuracy. In addition, FactorJoin can estimate 10,000 sub-plan queries within one second to optimize the query plan, which is very close to the traditional cardinality estimators in commercial DBMS.
Data preparation has been seen as "janitor work" yet essential in data-to-insight pipelines. The increasing liberality of data is followed by an explosion in the diversity of data consumers. However, the required technical and domain expertise prevents many from performing extensive data preparation. Further, many seem to be stuck in a vicious cycle of writing one-off programs to process data. Recently, automating data preparation programs has been shown to improve many aspects of the pipeline, including data quality, research reproducibility, and user productivity. We propose a novel approach to automatically improve data preparation programs.
Self-Organizing Data Containers
We propose a new self-organizing, self-optimizing, meta-data rich storage layer for the cloud, called a self-organizing data container (SDC), that enables order-of-magnitude performance improvements in data-intensive applications through instance-optimization, i.e., the adaptation of data representation to exploit both the distribution of the data and the workload operating on it. Unlike existing cloud storage systems like Delta Lake, Apache Iceberg, and Apache Hudi, SDCs capture both data and metadata, like access histories and distributional statistics, and are designed to be flexible enough to encompass a variety of modern high-performance representations for data analytics, including partitioning, replication, indexing, and materialization.
Serverless State Management Systems
Modern cloud developers face many distributed systems complexities when building disaggregated applications from cloud building blocks. We propose a new class of cloud services, called Serverless State Management Systems (SSMS), that abstracts away these complexities and transparently manages fault-tolerance, deployment, and scaling of a logical cloud application on physical cloud resources. An SSMS, analogous to a DBMS, provides three important abstractions for disaggregated applications: 1) a logical application model, similar to relational algebra, that describes application semantics but abstracts away the deployment details, 2) strong resilient programming primitives, similar to ACID transactions, that simplifies fault-tolerant programming in the cloud, and 3) smart, cost-based optimization schemes that automates scheduling, placement, and other details, similar to a query optimizer. SSMS is an overarching research direction that encapsulates several projects in cloud, distributed and concurrent systems.
Treeline: An Update-In-Place Key-Value Store for Modern Storage
Many modern key-value stores, such as RocksDB, rely on log-structured merge trees (LSMs). Originally designed for spinning disks, LSMs optimize for write performance by only making sequential writes. But this optimization comes at the cost of reads: LSMs must rely on expensive compaction jobs and Bloom filters—all to maintain reasonable read performance. For NVMe SSDs, we argue that trading off read performance for write performance is no longer always needed. With enough parallelism, NVMe SSDs have comparable random and sequential access performance. This change makes update-in-place designs, which traditionally provide excellent read performance, a viable alternative to LSMs. In our paper, we close the gap between log-structured and update-in-place designs on modern SSDs with the help of new components that take advantage of data and workload patterns. Specifically, we explore three key ideas: (A) record caching for efficient point operations, (B) page grouping for high-performance range scans, and (C) insert forecasting to reduce the reorganization costs of accommodating new records. We evaluate these ideas by implementing them in a prototype update-in-place key-value store called TreeLine. On YCSB, we find that TreeLine outperforms RocksDB and LeanStore by 2.20× and 2.07× respectively on average across the point workloads, and by up to 10.95× and 7.52× overall.