Infrastructure

I/O Schedulers: Why the Kernel Reorders Your Writes

Deep dive into I/O schedulers, Direct I/O, io_uring, and AWS EBS optimization. Block layer internals for predictable storage latency.

8 min
#storage #latency #nvme #aws #ebs #linux #infrastructure

The kernel is reordering your writes. You asked for A then B. The disk received B then A.

I/O schedulers optimize for throughput by batching and reordering requests. For trading audit logs, this means your write queues behind background activity-100µs+ added to what should be a 10µs operation.

This post covers the Linux block layer, why defaults hurt latency, and how to get predictable storage performance.

The Problem {#the-problem}

Default storage settings prioritize throughput:

Default BehaviorWhy It ExistsLatency Impact
I/O schedulersBatching/reordering for HDD seeks1-10ms queueing
Page cache bufferingWrite coalescingUnpredictable flush timing
Request mergingFewer I/O operationsDelay while accumulating
EBS burst behaviorCost optimizationVariable IOPS

For CPU-related storage interactions, see CPU Deep Dive. For memory interactions (page cache), see Memory Deep Dive.

Background: Block Layer Internals {#background}

The Block I/O Path

When you call write(), the path is (block/blk-core.c):

write() syscall

VFS layer (file operations)

Page cache (unless O_DIRECT)

Filesystem (ext4, xfs)

Block layer (I/O scheduling)

Device driver (nvme, sd)

Hardware

The block layer’s job: Convert file-level operations to block-level operations, queue them efficiently, and submit to hardware.

I/O Schedulers

I/O schedulers (block/mq-deadline.c, etc.) reorder requests to improve throughput. Historical context:

SchedulerEraDesign Goal
CFQHDD eraFair bandwidth between processes
DeadlineHDD eraBounded latency with reordering
BFQModernProportional bandwidth
mq-deadlineMulti-queueDeadline for NVMe
noneModernPass-through (no scheduling)

Why reordering helped HDDs: Seeks take 5-10ms. Reordering requests to minimize head movement saves time.

Why reordering hurts NVMe: NVMe has no seek penalty. Random I/O is as fast as sequential. Scheduling overhead is pure latency addition.

Multi-Queue Block Layer

Modern kernels use the multi-queue block layer (blk-mq, block/blk-mq.c):

Per-CPU software queues

Hardware dispatch queues

NVMe submission queues

Each CPU has its own queue, reducing lock contention. But schedulers still operate between software and hardware queues.

Fix 1: I/O Scheduler Selection {#scheduler}

The Problem

Even on NVMe, some distributions default to mq-deadline:

cat /sys/block/nvme0n1/queue/scheduler
# [mq-deadline] kyber bfq none

Every scheduler adds overhead-even minimal scheduling adds microseconds.

The Fix

# Check current
cat /sys/block/nvme0n1/queue/scheduler

# Set to none (bypass scheduling)
echo none | sudo tee /sys/block/nvme0n1/queue/scheduler

# Verify
cat /sys/block/nvme0n1/queue/scheduler
# [none] mq-deadline kyber bfq

Persistent via udev:

# /etc/udev/rules.d/60-scheduler.rules
ACTION=="add|change", KERNEL=="nvme[0-9]*", ATTR{queue/scheduler}="none"

Why It Works (Kernel Internals)

The none scheduler (block/blk-mq-sched.c) does minimal work:

// With none scheduler:
blk_mq_request_bypass_insert()  // Direct to hardware queue

No reordering, no batching, minimal overhead.

Expected Improvement

Eliminates 1-10ms scheduler queueing on NVMe. For HDDs, mq-deadline may still be better.

Fix 2: Direct I/O {#direct-io}

The Problem

Standard writes go through the page cache:

write() → page cache → (later) disk

“Later” is unpredictable. The pdflush (now background writeback) kernel thread decides when to flush based on:

  • dirty_ratio thresholds
  • dirty_expire_centisecs age
  • Memory pressure

For audit logs: You call write(), return to trading. 500ms later, background writeback stalls your trading thread while flushing.

The Kernel Mechanism

O_DIRECT (fs/direct-io.c) bypasses the page cache:

write() with O_DIRECT → disk immediately

Requirements:

  • Buffer must be aligned (typically 512 bytes or 4KB)
  • Size must be multiple of block size
  • No write coalescing benefit

The Fix

C:

#include <fcntl.h>
#include <unistd.h>

int fd = open("/data/audit.log", O_WRONLY | O_CREAT | O_DIRECT, 0644);

// Buffer must be aligned
void* buf;
posix_memalign(&buf, 4096, 4096);  // 4KB aligned

// Write directly to disk
ssize_t written = write(fd, buf, 4096);

Python:

import os
import mmap

# Open with O_DIRECT
fd = os.open('/data/audit.log', os.O_WRONLY | os.O_CREAT | os.O_DIRECT, 0o644)

# Aligned buffer via mmap
buf = mmap.mmap(-1, 4096)
buf[:len(data)] = data

os.write(fd, buf[:4096])

Trade-offs

  • No write coalescing: Multiple small writes = multiple I/O operations
  • Alignment requirements: Adds complexity
  • No read-ahead: Sequential reads won’t benefit from prefetching

Expected Improvement

Predictable I/O latency (10-50µs on NVMe) instead of variable background flush timing.

Fix 3: io_uring {#io-uring}

The Problem

Traditional syscalls (read, write) involve context switches:

User space → Kernel → User space

Each transition costs 0.5-2µs. For high-frequency I/O, this adds up.

The Kernel Mechanism

io_uring (io_uring/) uses shared memory rings:

Submission queue (user writes here)

Kernel processes asynchronously

Completion queue (user reads here)

No syscall for submission. The kernel polls the submission ring.

The Fix

Using liburing (C):

#include <liburing.h>

struct io_uring ring;
io_uring_queue_init(32, &ring, 0);

// Prepare write
struct io_uring_sqe *sqe = io_uring_get_sqe(&ring);
io_uring_prep_write(sqe, fd, buf, len, offset);
sqe->user_data = my_context;

// Submit (batch multiple)
io_uring_submit(&ring);

// Wait for completion
struct io_uring_cqe *cqe;
io_uring_wait_cqe(&ring, &cqe);

// Process result
int result = cqe->res;
io_uring_cqe_seen(&ring, cqe);

Python (via liburing bindings):

import io_uring  # Various bindings available

ring = io_uring.Ring(32)
ring.prep_write(fd, buffer, len, offset)
ring.submit()
cqe = ring.wait()

Expected Improvement

Saves 0.5-2µs per I/O operation from eliminated syscall overhead. At 100K IOPS, this is 50-200ms/second saved.

Citation: io_uring performance documented by Jens Axboe.

Fix 4: Dirty Page Tuning {#dirty-pages}

The Problem

Default dirty page thresholds allow large amounts of buffered data:

sysctl vm.dirty_ratio
# vm.dirty_ratio = 20 (20% of RAM can be dirty)

sysctl vm.dirty_background_ratio
# vm.dirty_background_ratio = 10

With 64GB RAM, 20% = 12GB of dirty data before forced writeback. When writeback finally happens, it’s a storm.

The Fix

# Smaller dirty buffers = more frequent, smaller flushes
sudo sysctl -w vm.dirty_ratio=5
sudo sysctl -w vm.dirty_background_ratio=2

# Faster writeback age
sudo sysctl -w vm.dirty_expire_centisecs=100  # 1 second
sudo sysctl -w vm.dirty_writeback_centisecs=100

Make persistent:

# /etc/sysctl.d/60-latency.conf
vm.dirty_ratio = 5
vm.dirty_background_ratio = 2
vm.dirty_expire_centisecs = 100
vm.dirty_writeback_centisecs = 100

Why It Works

Smaller buffers mean:

  • Background writeback starts earlier
  • Each flush is smaller
  • No sudden I/O storms blocking allocations

Expected Improvement

Reduces I/O stall variance from write storms.

Fix 5: AWS EBS Optimization {#ebs}

The Problem

EBS volume performance varies by type:

TypeBaseline IOPSMax IOPSLatency
gp33,00016,0001-4ms
io2Provisioned64,000<1ms
io2 Block ExpressProvisioned256,000<1ms
Instance store (NVMe)~400,000~400,000<100µs

gp3 burst behavior: IOPS above baseline use burst credits. When depleted, latency spikes.

The Fix

For latency-critical workloads:

# Terraform: io2 for predictable IOPS
resource "aws_ebs_volume" "trading" {
  availability_zone = "us-east-1a"
  size              = 100
  type              = "io2"
  iops              = 16000  # Provisioned, no burst
  throughput        = 500
}

For lowest latency (ephemeral data):

# Instance types with NVMe instance store
resource "aws_instance" "trading" {
  instance_type = "i3.xlarge"  # Includes NVMe SSD
  
  # WARNING: Instance store is ephemeral!
  # Use for cache, not persistent data
}

EBS-Optimized Instances

resource "aws_instance" "trading" {
  instance_type = "c6in.xlarge"
  ebs_optimized = true  # Dedicated EBS bandwidth
}

EBS-optimized ensures storage traffic doesn’t compete with network traffic.

Verification

# Monitor IOPS and latency
iostat -x 1

# Check for burst credit depletion (CloudWatch)
# BurstBalance metric shows remaining credits

Expected Improvement

io2 vs gp3: Eliminates burst-related latency variance. Instance store: 10x faster than EBS.

Citation: AWS EBS documentation.

Design Philosophy {#design-philosophy}

The Golden Rule

Hot path never touches disk.

Storage I/O is 10-100µs minimum (NVMe), 1-10ms typical (EBS). No amount of tuning makes it competitive with memory (100ns).

Architecture implications:

OperationWhere It Belongs
Market data processingMemory only
Order decisionMemory only
Audit loggingAsync queue, separate thread
State persistenceWrite-ahead log, batched
RecoveryStartup, not hot path

When Defaults Are Right

Storage optimizations matter for:

  • Audit logs: Compliance requires writes
  • State persistence: Crash recovery
  • Market data replay: Historical analysis

They don’t matter for:

  • Hot path: If you’re reading/writing disk here, redesign

The Tradeoff

ChangeWe Give UpWe Get
none schedulerFairness between processesImmediate dispatch
O_DIRECTWrite coalescingPredictable timing
io_uringSimpler codeLower syscall overhead
Lower dirty_ratioLarge batch efficiencyNo write storms
io2 EBSCost savingsPredictable IOPS

Audit Your Infrastructure

Want to check if your servers are configured for low latency? Run latency-audit - it checks I/O schedulers, filesystem settings, and 30+ other configurations in seconds.

pip install latency-audit && latency-audit

Reading Path

Continue exploring with these related deep dives:

TopicNext Post
THP, huge pages, memory locking, pre-allocationMemory Tuning for Low-Latency: The THP Trap and HugePage Mastery
CPU governors, C-states, NUMA, isolationCPU Isolation for HFT: The isolcpus Lie and What Actually Works
The 5 kernel settings that cost you latencyThe $2M Millisecond: Linux Defaults That Cost You Money
SLOs, metrics that matter, alertingTrading Metrics: What SRE Dashboards Miss
Share: LinkedIn X