I/O Schedulers: Why the Kernel Reorders Your Writes

The kernel is reordering your writes. You asked for A then B. The disk received B then A.

I/O schedulers optimize for throughput by batching and reordering requests. For trading audit logs, this means your write queues behind background activity-100µs+ added to what should be a 10µs operation.

This post covers the Linux block layer, why defaults hurt latency, and how to get predictable storage performance.

The Problem {#the-problem}

Default storage settings prioritize throughput:

Default Behavior	Why It Exists	Latency Impact
I/O schedulers	Batching/reordering for HDD seeks	1-10ms queueing
Page cache buffering	Write coalescing	Unpredictable flush timing
Request merging	Fewer I/O operations	Delay while accumulating
EBS burst behavior	Cost optimization	Variable IOPS

For CPU-related storage interactions, see CPU Deep Dive. For memory interactions (page cache), see Memory Deep Dive.

Background: Block Layer Internals {#background}

The Block I/O Path

When you call write(), the path is (block/blk-core.c):

write() syscall
    ↓
VFS layer (file operations)
    ↓
Page cache (unless O_DIRECT)
    ↓
Filesystem (ext4, xfs)
    ↓
Block layer (I/O scheduling)
    ↓
Device driver (nvme, sd)
    ↓
Hardware

The block layer’s job: Convert file-level operations to block-level operations, queue them efficiently, and submit to hardware.

I/O Schedulers

I/O schedulers (block/mq-deadline.c, etc.) reorder requests to improve throughput. Historical context:

Scheduler	Era	Design Goal
CFQ	HDD era	Fair bandwidth between processes
Deadline	HDD era	Bounded latency with reordering
BFQ	Modern	Proportional bandwidth
mq-deadline	Multi-queue	Deadline for NVMe
none	Modern	Pass-through (no scheduling)

Why reordering helped HDDs: Seeks take 5-10ms. Reordering requests to minimize head movement saves time.

Why reordering hurts NVMe: NVMe has no seek penalty. Random I/O is as fast as sequential. Scheduling overhead is pure latency addition.

Multi-Queue Block Layer

Modern kernels use the multi-queue block layer (blk-mq, block/blk-mq.c):

Per-CPU software queues
    ↓
Hardware dispatch queues
    ↓
NVMe submission queues

Each CPU has its own queue, reducing lock contention. But schedulers still operate between software and hardware queues.

Fix 1: I/O Scheduler Selection {#scheduler}

The Problem

Even on NVMe, some distributions default to mq-deadline:

cat /sys/block/nvme0n1/queue/scheduler
# [mq-deadline] kyber bfq none

Every scheduler adds overhead-even minimal scheduling adds microseconds.

The Fix

# Check current
cat /sys/block/nvme0n1/queue/scheduler

# Set to none (bypass scheduling)
echo none | sudo tee /sys/block/nvme0n1/queue/scheduler

# Verify
cat /sys/block/nvme0n1/queue/scheduler
# [none] mq-deadline kyber bfq

Persistent via udev:

# /etc/udev/rules.d/60-scheduler.rules
ACTION=="add|change", KERNEL=="nvme[0-9]*", ATTR{queue/scheduler}="none"

Why It Works (Kernel Internals)

The none scheduler (block/blk-mq-sched.c) does minimal work:

// With none scheduler:
blk_mq_request_bypass_insert()  // Direct to hardware queue

No reordering, no batching, minimal overhead.

Expected Improvement

Eliminates 1-10ms scheduler queueing on NVMe. For HDDs, mq-deadline may still be better.

Fix 2: Direct I/O {#direct-io}

The Problem

Standard writes go through the page cache:

write() → page cache → (later) disk

“Later” is unpredictable. The pdflush (now background writeback) kernel thread decides when to flush based on:

dirty_ratio thresholds
dirty_expire_centisecs age
Memory pressure

For audit logs: You call write(), return to trading. 500ms later, background writeback stalls your trading thread while flushing.

The Kernel Mechanism

O_DIRECT (fs/direct-io.c) bypasses the page cache:

write() with O_DIRECT → disk immediately

Requirements:

Buffer must be aligned (typically 512 bytes or 4KB)
Size must be multiple of block size
No write coalescing benefit

The Fix

#include <fcntl.h>
#include <unistd.h>

int fd = open("/data/audit.log", O_WRONLY | O_CREAT | O_DIRECT, 0644);

// Buffer must be aligned
void* buf;
posix_memalign(&buf, 4096, 4096);  // 4KB aligned

// Write directly to disk
ssize_t written = write(fd, buf, 4096);

Python:

import os
import mmap

# Open with O_DIRECT
fd = os.open('/data/audit.log', os.O_WRONLY | os.O_CREAT | os.O_DIRECT, 0o644)

# Aligned buffer via mmap
buf = mmap.mmap(-1, 4096)
buf[:len(data)] = data

os.write(fd, buf[:4096])

Trade-offs

No write coalescing: Multiple small writes = multiple I/O operations
Alignment requirements: Adds complexity
No read-ahead: Sequential reads won’t benefit from prefetching

Expected Improvement

Predictable I/O latency (10-50µs on NVMe) instead of variable background flush timing.

Fix 3: io_uring {#io-uring}

The Problem

Traditional syscalls (read, write) involve context switches:

User space → Kernel → User space

Each transition costs 0.5-2µs. For high-frequency I/O, this adds up.

The Kernel Mechanism

io_uring (io_uring/) uses shared memory rings:

Submission queue (user writes here)
    ↓
Kernel processes asynchronously
    ↓
Completion queue (user reads here)

No syscall for submission. The kernel polls the submission ring.

The Fix

Using liburing (C):

#include <liburing.h>

struct io_uring ring;
io_uring_queue_init(32, &ring, 0);

// Prepare write
struct io_uring_sqe *sqe = io_uring_get_sqe(&ring);
io_uring_prep_write(sqe, fd, buf, len, offset);
sqe->user_data = my_context;

// Submit (batch multiple)
io_uring_submit(&ring);

// Wait for completion
struct io_uring_cqe *cqe;
io_uring_wait_cqe(&ring, &cqe);

// Process result
int result = cqe->res;
io_uring_cqe_seen(&ring, cqe);

Python (via liburing bindings):

import io_uring  # Various bindings available

ring = io_uring.Ring(32)
ring.prep_write(fd, buffer, len, offset)
ring.submit()
cqe = ring.wait()

Expected Improvement

Saves 0.5-2µs per I/O operation from eliminated syscall overhead. At 100K IOPS, this is 50-200ms/second saved.

Citation: io_uring performance documented by Jens Axboe.

Fix 4: Dirty Page Tuning {#dirty-pages}

The Problem

Default dirty page thresholds allow large amounts of buffered data:

sysctl vm.dirty_ratio
# vm.dirty_ratio = 20 (20% of RAM can be dirty)

sysctl vm.dirty_background_ratio
# vm.dirty_background_ratio = 10

With 64GB RAM, 20% = 12GB of dirty data before forced writeback. When writeback finally happens, it’s a storm.

The Fix

# Smaller dirty buffers = more frequent, smaller flushes
sudo sysctl -w vm.dirty_ratio=5
sudo sysctl -w vm.dirty_background_ratio=2

# Faster writeback age
sudo sysctl -w vm.dirty_expire_centisecs=100  # 1 second
sudo sysctl -w vm.dirty_writeback_centisecs=100

Make persistent:

# /etc/sysctl.d/60-latency.conf
vm.dirty_ratio = 5
vm.dirty_background_ratio = 2
vm.dirty_expire_centisecs = 100
vm.dirty_writeback_centisecs = 100

Why It Works

Smaller buffers mean:

Background writeback starts earlier
Each flush is smaller
No sudden I/O storms blocking allocations

Expected Improvement

Reduces I/O stall variance from write storms.

Fix 5: AWS EBS Optimization {#ebs}

The Problem

EBS volume performance varies by type:

Type	Baseline IOPS	Max IOPS	Latency
gp3	3,000	16,000	1-4ms
io2	Provisioned	64,000	<1ms
io2 Block Express	Provisioned	256,000	<1ms
Instance store (NVMe)	~400,000	~400,000	<100µs

gp3 burst behavior: IOPS above baseline use burst credits. When depleted, latency spikes.

The Fix

For latency-critical workloads:

# Terraform: io2 for predictable IOPS
resource "aws_ebs_volume" "trading" {
  availability_zone = "us-east-1a"
  size              = 100
  type              = "io2"
  iops              = 16000  # Provisioned, no burst
  throughput        = 500
}

For lowest latency (ephemeral data):

# Instance types with NVMe instance store
resource "aws_instance" "trading" {
  instance_type = "i3.xlarge"  # Includes NVMe SSD
  
  # WARNING: Instance store is ephemeral!
  # Use for cache, not persistent data
}

EBS-Optimized Instances

resource "aws_instance" "trading" {
  instance_type = "c6in.xlarge"
  ebs_optimized = true  # Dedicated EBS bandwidth
}

EBS-optimized ensures storage traffic doesn’t compete with network traffic.

Verification

# Monitor IOPS and latency
iostat -x 1

# Check for burst credit depletion (CloudWatch)
# BurstBalance metric shows remaining credits

Expected Improvement

io2 vs gp3: Eliminates burst-related latency variance. Instance store: 10x faster than EBS.

Citation: AWS EBS documentation.

Design Philosophy {#design-philosophy}

The Golden Rule

Hot path never touches disk.

Storage I/O is 10-100µs minimum (NVMe), 1-10ms typical (EBS). No amount of tuning makes it competitive with memory (100ns).

Architecture implications:

Operation	Where It Belongs
Market data processing	Memory only
Order decision	Memory only
Audit logging	Async queue, separate thread
State persistence	Write-ahead log, batched
Recovery	Startup, not hot path

When Defaults Are Right

Storage optimizations matter for:

Audit logs: Compliance requires writes
State persistence: Crash recovery
Market data replay: Historical analysis

They don’t matter for:

Hot path: If you’re reading/writing disk here, redesign

The Tradeoff

Change	We Give Up	We Get
none scheduler	Fairness between processes	Immediate dispatch
O_DIRECT	Write coalescing	Predictable timing
io_uring	Simpler code	Lower syscall overhead
Lower dirty_ratio	Large batch efficiency	No write storms
io2 EBS	Cost savings	Predictable IOPS

Audit Your Infrastructure

Want to check if your servers are configured for low latency? Run latency-audit - it checks I/O schedulers, filesystem settings, and 30+ other configurations in seconds.

pip install latency-audit && latency-audit

Topic	Next Post
THP, huge pages, memory locking, pre-allocation	Memory Tuning for Low-Latency: The THP Trap and HugePage Mastery
CPU governors, C-states, NUMA, isolation	CPU Isolation for HFT: The isolcpus Lie and What Actually Works
The 5 kernel settings that cost you latency	The $2M Millisecond: Linux Defaults That Cost You Money
SLOs, metrics that matter, alerting	Trading Metrics: What SRE Dashboards Miss

The Problem {#the-problem}

Background: Block Layer Internals {#background}

The Block I/O Path

I/O Schedulers

Multi-Queue Block Layer

Fix 1: I/O Scheduler Selection {#scheduler}

The Problem

The Fix

Why It Works (Kernel Internals)

Expected Improvement

Fix 2: Direct I/O {#direct-io}

The Problem

The Kernel Mechanism

The Fix

Trade-offs

Expected Improvement

Fix 3: io_uring {#io-uring}

The Problem

The Kernel Mechanism

The Fix

Expected Improvement

Fix 4: Dirty Page Tuning {#dirty-pages}

The Problem

The Fix

Why It Works

Expected Improvement

Fix 5: AWS EBS Optimization {#ebs}

The Problem

The Fix

EBS-Optimized Instances

Verification

Expected Improvement

Design Philosophy {#design-philosophy}

The Golden Rule

When Defaults Are Right

The Tradeoff

Audit Your Infrastructure

Trading Infrastructure: First Principles That Scale

Reading Path