Infrastructure
I/O Schedulers: Why the Kernel Reorders Your Writes
Deep dive into I/O schedulers, Direct I/O, io_uring, and AWS EBS optimization. Block layer internals for predictable storage latency.
The kernel is reordering your writes. You asked for A then B. The disk received B then A.
I/O schedulers optimize for throughput by batching and reordering requests. For trading audit logs, this means your write queues behind background activity-100µs+ added to what should be a 10µs operation.
This post covers the Linux block layer, why defaults hurt latency, and how to get predictable storage performance.
The Problem {#the-problem}
Default storage settings prioritize throughput:
| Default Behavior | Why It Exists | Latency Impact |
|---|---|---|
| I/O schedulers | Batching/reordering for HDD seeks | 1-10ms queueing |
| Page cache buffering | Write coalescing | Unpredictable flush timing |
| Request merging | Fewer I/O operations | Delay while accumulating |
| EBS burst behavior | Cost optimization | Variable IOPS |
For CPU-related storage interactions, see CPU Deep Dive. For memory interactions (page cache), see Memory Deep Dive.
Background: Block Layer Internals {#background}
The Block I/O Path
When you call write(), the path is (block/blk-core.c):
write() syscall
↓
VFS layer (file operations)
↓
Page cache (unless O_DIRECT)
↓
Filesystem (ext4, xfs)
↓
Block layer (I/O scheduling)
↓
Device driver (nvme, sd)
↓
Hardware
The block layer’s job: Convert file-level operations to block-level operations, queue them efficiently, and submit to hardware.
I/O Schedulers
I/O schedulers (block/mq-deadline.c, etc.) reorder requests to improve throughput. Historical context:
| Scheduler | Era | Design Goal |
|---|---|---|
| CFQ | HDD era | Fair bandwidth between processes |
| Deadline | HDD era | Bounded latency with reordering |
| BFQ | Modern | Proportional bandwidth |
| mq-deadline | Multi-queue | Deadline for NVMe |
| none | Modern | Pass-through (no scheduling) |
Why reordering helped HDDs: Seeks take 5-10ms. Reordering requests to minimize head movement saves time.
Why reordering hurts NVMe: NVMe has no seek penalty. Random I/O is as fast as sequential. Scheduling overhead is pure latency addition.
Multi-Queue Block Layer
Modern kernels use the multi-queue block layer (blk-mq, block/blk-mq.c):
Per-CPU software queues
↓
Hardware dispatch queues
↓
NVMe submission queues
Each CPU has its own queue, reducing lock contention. But schedulers still operate between software and hardware queues.
Fix 1: I/O Scheduler Selection {#scheduler}
The Problem
Even on NVMe, some distributions default to mq-deadline:
cat /sys/block/nvme0n1/queue/scheduler
# [mq-deadline] kyber bfq none
Every scheduler adds overhead-even minimal scheduling adds microseconds.
The Fix
# Check current
cat /sys/block/nvme0n1/queue/scheduler
# Set to none (bypass scheduling)
echo none | sudo tee /sys/block/nvme0n1/queue/scheduler
# Verify
cat /sys/block/nvme0n1/queue/scheduler
# [none] mq-deadline kyber bfq
Persistent via udev:
# /etc/udev/rules.d/60-scheduler.rules
ACTION=="add|change", KERNEL=="nvme[0-9]*", ATTR{queue/scheduler}="none"
Why It Works (Kernel Internals)
The none scheduler (block/blk-mq-sched.c) does minimal work:
// With none scheduler:
blk_mq_request_bypass_insert() // Direct to hardware queue
No reordering, no batching, minimal overhead.
Expected Improvement
Eliminates 1-10ms scheduler queueing on NVMe. For HDDs, mq-deadline may still be better.
Fix 2: Direct I/O {#direct-io}
The Problem
Standard writes go through the page cache:
write() → page cache → (later) disk
“Later” is unpredictable. The pdflush (now background writeback) kernel thread decides when to flush based on:
- dirty_ratio thresholds
- dirty_expire_centisecs age
- Memory pressure
For audit logs: You call write(), return to trading. 500ms later, background writeback stalls your trading thread while flushing.
The Kernel Mechanism
O_DIRECT (fs/direct-io.c) bypasses the page cache:
write() with O_DIRECT → disk immediately
Requirements:
- Buffer must be aligned (typically 512 bytes or 4KB)
- Size must be multiple of block size
- No write coalescing benefit
The Fix
C:
#include <fcntl.h>
#include <unistd.h>
int fd = open("/data/audit.log", O_WRONLY | O_CREAT | O_DIRECT, 0644);
// Buffer must be aligned
void* buf;
posix_memalign(&buf, 4096, 4096); // 4KB aligned
// Write directly to disk
ssize_t written = write(fd, buf, 4096);
Python:
import os
import mmap
# Open with O_DIRECT
fd = os.open('/data/audit.log', os.O_WRONLY | os.O_CREAT | os.O_DIRECT, 0o644)
# Aligned buffer via mmap
buf = mmap.mmap(-1, 4096)
buf[:len(data)] = data
os.write(fd, buf[:4096])
Trade-offs
- No write coalescing: Multiple small writes = multiple I/O operations
- Alignment requirements: Adds complexity
- No read-ahead: Sequential reads won’t benefit from prefetching
Expected Improvement
Predictable I/O latency (10-50µs on NVMe) instead of variable background flush timing.
Fix 3: io_uring {#io-uring}
The Problem
Traditional syscalls (read, write) involve context switches:
User space → Kernel → User space
Each transition costs 0.5-2µs. For high-frequency I/O, this adds up.
The Kernel Mechanism
io_uring (io_uring/) uses shared memory rings:
Submission queue (user writes here)
↓
Kernel processes asynchronously
↓
Completion queue (user reads here)
No syscall for submission. The kernel polls the submission ring.
The Fix
Using liburing (C):
#include <liburing.h>
struct io_uring ring;
io_uring_queue_init(32, &ring, 0);
// Prepare write
struct io_uring_sqe *sqe = io_uring_get_sqe(&ring);
io_uring_prep_write(sqe, fd, buf, len, offset);
sqe->user_data = my_context;
// Submit (batch multiple)
io_uring_submit(&ring);
// Wait for completion
struct io_uring_cqe *cqe;
io_uring_wait_cqe(&ring, &cqe);
// Process result
int result = cqe->res;
io_uring_cqe_seen(&ring, cqe);
Python (via liburing bindings):
import io_uring # Various bindings available
ring = io_uring.Ring(32)
ring.prep_write(fd, buffer, len, offset)
ring.submit()
cqe = ring.wait()
Expected Improvement
Saves 0.5-2µs per I/O operation from eliminated syscall overhead. At 100K IOPS, this is 50-200ms/second saved.
Citation: io_uring performance documented by Jens Axboe.
Fix 4: Dirty Page Tuning {#dirty-pages}
The Problem
Default dirty page thresholds allow large amounts of buffered data:
sysctl vm.dirty_ratio
# vm.dirty_ratio = 20 (20% of RAM can be dirty)
sysctl vm.dirty_background_ratio
# vm.dirty_background_ratio = 10
With 64GB RAM, 20% = 12GB of dirty data before forced writeback. When writeback finally happens, it’s a storm.
The Fix
# Smaller dirty buffers = more frequent, smaller flushes
sudo sysctl -w vm.dirty_ratio=5
sudo sysctl -w vm.dirty_background_ratio=2
# Faster writeback age
sudo sysctl -w vm.dirty_expire_centisecs=100 # 1 second
sudo sysctl -w vm.dirty_writeback_centisecs=100
Make persistent:
# /etc/sysctl.d/60-latency.conf
vm.dirty_ratio = 5
vm.dirty_background_ratio = 2
vm.dirty_expire_centisecs = 100
vm.dirty_writeback_centisecs = 100
Why It Works
Smaller buffers mean:
- Background writeback starts earlier
- Each flush is smaller
- No sudden I/O storms blocking allocations
Expected Improvement
Reduces I/O stall variance from write storms.
Fix 5: AWS EBS Optimization {#ebs}
The Problem
EBS volume performance varies by type:
| Type | Baseline IOPS | Max IOPS | Latency |
|---|---|---|---|
| gp3 | 3,000 | 16,000 | 1-4ms |
| io2 | Provisioned | 64,000 | <1ms |
| io2 Block Express | Provisioned | 256,000 | <1ms |
| Instance store (NVMe) | ~400,000 | ~400,000 | <100µs |
gp3 burst behavior: IOPS above baseline use burst credits. When depleted, latency spikes.
The Fix
For latency-critical workloads:
# Terraform: io2 for predictable IOPS
resource "aws_ebs_volume" "trading" {
availability_zone = "us-east-1a"
size = 100
type = "io2"
iops = 16000 # Provisioned, no burst
throughput = 500
}
For lowest latency (ephemeral data):
# Instance types with NVMe instance store
resource "aws_instance" "trading" {
instance_type = "i3.xlarge" # Includes NVMe SSD
# WARNING: Instance store is ephemeral!
# Use for cache, not persistent data
}
EBS-Optimized Instances
resource "aws_instance" "trading" {
instance_type = "c6in.xlarge"
ebs_optimized = true # Dedicated EBS bandwidth
}
EBS-optimized ensures storage traffic doesn’t compete with network traffic.
Verification
# Monitor IOPS and latency
iostat -x 1
# Check for burst credit depletion (CloudWatch)
# BurstBalance metric shows remaining credits
Expected Improvement
io2 vs gp3: Eliminates burst-related latency variance. Instance store: 10x faster than EBS.
Citation: AWS EBS documentation.
Design Philosophy {#design-philosophy}
The Golden Rule
Hot path never touches disk.
Storage I/O is 10-100µs minimum (NVMe), 1-10ms typical (EBS). No amount of tuning makes it competitive with memory (100ns).
Architecture implications:
| Operation | Where It Belongs |
|---|---|
| Market data processing | Memory only |
| Order decision | Memory only |
| Audit logging | Async queue, separate thread |
| State persistence | Write-ahead log, batched |
| Recovery | Startup, not hot path |
When Defaults Are Right
Storage optimizations matter for:
- Audit logs: Compliance requires writes
- State persistence: Crash recovery
- Market data replay: Historical analysis
They don’t matter for:
- Hot path: If you’re reading/writing disk here, redesign
The Tradeoff
| Change | We Give Up | We Get |
|---|---|---|
| none scheduler | Fairness between processes | Immediate dispatch |
| O_DIRECT | Write coalescing | Predictable timing |
| io_uring | Simpler code | Lower syscall overhead |
| Lower dirty_ratio | Large batch efficiency | No write storms |
| io2 EBS | Cost savings | Predictable IOPS |
Audit Your Infrastructure
Want to check if your servers are configured for low latency? Run latency-audit - it checks I/O schedulers, filesystem settings, and 30+ other configurations in seconds.
pip install latency-audit && latency-audit Up Next in Linux Infrastructure Deep Dives
Trading Infrastructure: First Principles That Scale
Architecture decisions that determine your latency ceiling. AWS, Kubernetes, monitoring, and security patterns for crypto trading systems.
Reading Path
Continue exploring with these related deep dives:
| Topic | Next Post |
|---|---|
| THP, huge pages, memory locking, pre-allocation | Memory Tuning for Low-Latency: The THP Trap and HugePage Mastery |
| CPU governors, C-states, NUMA, isolation | CPU Isolation for HFT: The isolcpus Lie and What Actually Works |
| The 5 kernel settings that cost you latency | The $2M Millisecond: Linux Defaults That Cost You Money |
| SLOs, metrics that matter, alerting | Trading Metrics: What SRE Dashboards Miss |