The Physics of Storage: NVMe Queues & io_uring

Why `write()` is a lie. The physics of NVMe Submission Queues, the Doorbell Register, and why `io_uring` beats `libaio`.

Intermediate • 40 min read • Expert Version →

🎯 What You'll Learn

Deconstruct the NVMe Protocol (Queues & Doorbells)
Compare `libaio` vs `io_uring` (Syscall Overhead)
Evaluate File System Journals (`ext4` vs `xfs` log latency)
Implement Direct IO (`O_DIRECT`) for durability
Tune the IO Scheduler (`noop` / `none` for SSDs)

📚 Prerequisites

Before this lesson, you should understand:

Introduction

In the old days of Spinning Rust (Hard Drives), the Kernel had to be smart. It had to reorder writes to minimize the physical movement of the disk head (Seek Time). Today, with NVMe SSDs, the “Disk Head” is a lie. We have 64,000 parallel queues. But the Linux Kernel defaults are often stuck in 1990.

This lesson explores Modern Storage Physics: How to scream data onto Flash memory without the OS getting in the way.

The Physics: NVMe Queues

An NVMe drive is not a block device. It is a Message Passing Interface. It places data in Ring Buffers in host RAM.

Submission Queue (SQ): Host writes commands here.
Doorbell: Host writes to a register on the SSD controller (Message: “Go work”).
Completion Queue (CQ): SSD writes status here when done.
Interrupt: SSD interrupts CPU to say “Check CQ”.

Physics Bottleneck: If you use extensive locking in the Kernel (Ext4 Journal), you force this parallel hardware to act serially.

The Old Way: Buffered IO vs Direct IO

Standard write() maps data into the Page Cache (RAM). Syncing to disk happens “eventually” (pdflush). Risk: Power loss = Data Loss.

O_DIRECT: Bypasses the Page Cache. Write goes straight to DMA. Requirement: Memory alignment (typically 4096 bytes).

// Opening a file for Direct IO
int fd = open("/data/trading.log", O_WRONLY | O_CREAT | O_DIRECT);
// Buffer MUST be aligned
void *buf;
posix_memalign(&buf, 4096, 4096);

The New Way: io_uring

libaio was Linux’s first attempt at Async IO. It failed (only worked for O_DIRECT, blocked on metadata). io_uring is the revolution.

Architecture: Shared Ring Buffers between Kernel and Userspace. You submit IO requests without a syscall. You reap completions without a syscall. Zero Context Switches.

// Pseudo Code: io_uring
struct io_uring_sqe *sqe = io_uring_get_sqe(&ring);
io_uring_prep_write(sqe, fd, buffer, len, offset);
io_uring_submit(&ring); // Non-blocking!

Performance:

Syscall write(): ~500k IOPS per core.
io_uring (Polled Mode): ~3 Million IOPS per core.

IO Schedulers: Do Less

For spinning disks, cfq or deadline schedulers reorder requests. For NVMe, reordering costs CPU and helps nothing.

The Fix: Set scheduler to none.

# Check current scheduler
cat /sys/block/nvme0n1/queue/scheduler 
# [mq-deadline] none

# Set to none (Physics: Don't touch my data)
echo none > /sys/block/nvme0n1/queue/scheduler

Practice Exercises

Exercise 1: Direct IO Benchmarking (Beginner)

Task: Use fio to test Buffered vs Direct IO. Command: fio --name=test --ioengine=sync --direct=1 --rw=write --bs=4k --size=1G. Observation: Throughput drops with direct=1, but latency consistency improves.

Exercise 2: io_uring vs write() (Intermediate)

Task: Run fio with --ioengine=io_uring vs --ioengine=psync. Action: Watch CPU usage. io_uring should use significantly less System CPU.

Exercise 3: Journal Latency (Advanced)

Task: Compare Ext4 vs XFS. Theory: Ext4 journals metadata and data. XFS is designed for parallelism. Action: Mount with noatime to stop timestamp writes.

Knowledge Check

What is the Doorbell Register?
Why does O_DIRECT require aligned memory?
Why is the Page Cache dangerous for databases?
How does io_uring eliminate syscalls?
Which IO Scheduler is best for NVMe?

Answers

A register on the NVMe controller that tells the drive new commands are in the Queue.
DMA Constraints. The hardware copies directly from RAM; it cannot handle unaligned boundaries.
Double Caching / Data Loss. Database expects durability; Page Cache lies about completion.
Shared Memory Rings. Kernel and User share the submission queue structure.
None (or noop). Let the hardware handle the parallelism.

Summary

NVMe: Parallel hardware needs parallel software.
O_DIRECT: Bypassing the Kernel’s lies.
io_uring: The future of standard Linux IO.
Scheduler: Turn it off.

Pro Version: For production-grade implementation details, see the full research article: storage-io-linux-latency

Pro Version: See the full research: Storage I/O for Linux Latency

Questions about this lesson? Working on related infrastructure?

Let's discuss