Practical DB-OS Co-design with Privileged Kernel Bypass
Privileged Kernel Bypass: A Vision for Virtualized DB‑OS Co‑Design
Rethinking the Database–Operating System Boundary
Modern database systems cannot fully utilize modern hardware capabilities because they run on general OS that provides dated OS interfaces and slow implementations mismatched for data-intensive workloads. For example, General‑purpose OS can manage a database’s buffer pages through their own page cache and memory‑mapped I/O. This mmap
‑based buffer pool eliminates the software hash tables used in traditional DBMS buffer managers, but it also introduces a severe bottleneck: whenever the OS evicts a cached page, it must invalidate address‑translation entries on every CPU core. These global TLB shootdowns are expensive and frequent, so they prevent a database from saturating the DRAM-like bandwidth and tens of millions of IOPS that modern SSDs can provide. Similarly, virtual memory snapshotting (fork
) service provided by OS is too slow for high performance databases. In other words, the OS’s virtual memory subsystem becomes the limiting factor.
Two approaches have been proposed for addressing such DB-OS interface mismatches. One is to modify the host kernel or run databases on a unikernel—a minimal OS that merges application and kernel into a single address space. Unikernels offer direct hardware access but require writing drivers and full POSIX compatibility, and they lack the rich tooling around Linux. The other approach specializes Linux with new system calls or eBPF modules, but these techniques face security and maintainability problems.
Our Approach: Running the DB as a Privileged Process
We propose privileged kernel bypass, a virtualization‑based co‑design in which the database runs in the kernel space of a lightweight guest OS on top of a hypervisor. The hypervisor (e.g., Dune) exports hardware‑assisted virtualization to a Linux process; system calls are proxied back to the host kernel, so the database preserves the POSIX process abstraction and remains compatible with the Linux ecosystem. Inside the guest kernel, the DBMS obtains direct access to privileged instructions—page‑table management, TLB invalidation, interrupt control and I/O virtualization—without crossing the system‑call boundary. Crucially, all modifications are confined to the guest kernel, leaving the host kernel unmodified. This approach expands the co‑design space without abandoning Linux or rewriting the entire OS.
Building New DB‑Specific Abstractions
To demonstrate the potential of privileged kernel bypass, we built Libdbos, a minimal guest kernel that exposes customizable exception handlers and memory allocation. Developers can override page‑fault handling, system‑call processing or interrupt routing to implement database‑specific optimizations.
Tabby: The first mechanism, Tabby, is buffer pool that leverage virtual memory hadrware directly and eliminates TLB shootdowns during page eviction. Tabby can lazily detect and fix stale page‑table entries so as to eliminate TLB-shootdown completely; it achieves state‑of‑the‑art performance for both in‑memory and out‑of‑memory workloads.
Snappy: Our second mechanism, Snappy, is a high‑frequency snapshotting subsystem that avoids expensive copy‑on‑write reference counting and can serve snapshots instantaneously. Snappy uses an epoch‑based page‑table snapshot and a simplified reclamation scheme to achieve snapshot efficiency; when integrated with Redis, it reduces tail latency during checkpointing by orders of magnitude compared with Linux fork
.
These prototypes illustrate how a DB running as a privileged process can exploit hardware primitives—virtual memory, interrupt handling and IOMMUs—to design abstractions that are impossible in user space on Linux.
Beyond Snappy and Tabby: A Virtualized DB‑OS Vision
Privileged kernel bypass is a vision, not just a single system. Virtualization turns the database into an operating‑system co‑designer: the DB can override page‑fault handlers, schedule tasks across cores, control I/O submission queues and fine‑tune isolation, all while delegating complex device drivers to Linux. For example, custom schedulers could use hardware preemption and cross‑core interrupts to prioritize latency-critical queries/transactions over latency-insensitive ones; database may use virtual memory hardware to design hardware-accelerated data structures.
Importantly, a privileged kernel bypass does not abandon the advances made by the Linux kernel community. Modern I/O interfaces like io_uring
have demonstrated that the Linux storage stack can deliver tens of millions of IOPS on NVMe SSD arrays. Because our database still runs as a Linux process, it can continue to issue system calls to these high‑performance interfaces and even integrate user‑space libraries such as DPDK and SPDK for networking and storage. The difference is that we only specialize the subsystems that are difficult to optimize in the host kernel—virtual memory management, snapshotting and buffer pools—while delegating I/O and device management to the host OS. This selective co‑design preserves compatibility with the broader Linux ecosystem and lets us leverage its ongoing innovations while achieving the performance and isolation benefits of a virtualized DB‑OS.
Publications
People
Xinjing Zhou, Viktor Leis, Jinming Hu, Xiangyao Yu, Michael Stonebraker