Infrastructure
The $2M Millisecond: Linux Defaults That Cost You Money
Deep dive into vm.swappiness, THP compaction, and C-states. Kernel internals, measurements, and the design philosophy behind low-latency Linux tuning.
I’ve audited trading servers at HFT firms and crypto exchanges for over a decade. 80% fail the same 5 kernel configuration checks.
The default Linux kernel is optimized for throughput-maximizing total work completed-not latency-minimizing response time. This is intentional. The kernel developers designed for the common case: web servers, databases, and batch processing where throughput matters more than individual request latency.
Trading is not the common case. A 100µs delay on a $1M order costs real money. This post goes deep into the kernel internals that cause these delays, why they exist, and how to fix them.
The Problem {#the-problem}
Every Linux distribution ships with defaults designed for general-purpose workloads:
| Setting | Default Value | Latency Impact | Root Cause |
|---|---|---|---|
| vm.swappiness | 60 | 10-100µs per page fault | Anonymous page reclamation |
| THP | always | 10-50ms compaction stalls | khugepaged defragmentation |
| CPU governor | powersave | 10-50µs frequency ramp | DVFS transitions |
| C-states | enabled | 50-100µs wake latency | Voltage/clock restoration |
| NIC offloads | enabled | 5-50µs packet batching | GRO/LRO coalescing |
Why do these defaults exist? They save power and maximize throughput. The kernel developers made reasonable tradeoffs for 99% of workloads. Trading systems are in the other 1%.
For deep dives into each subsystem, see:
- CPU Optimization - Governors, C-states, NUMA
- Memory Tuning - THP, swappiness, huge pages
- Network Optimization - Offloads, IRQ affinity
- Storage I/O - Schedulers, Direct I/O
Background: How Linux Memory Management Works {#background}
Before diving into fixes, you need to understand how Linux manages memory. This context explains why the defaults hurt latency.
The Page Cache and Anonymous Pages
Linux divides memory into two categories:
-
File-backed pages (page cache): Cached file contents. Can be evicted by dropping them (clean) or writing back (dirty).
-
Anonymous pages: Heap, stack, and mmap’d memory without a file backing. Can only be evicted by swapping to disk.
The kernel maintains LRU (Least Recently Used) lists for both types. When memory pressure occurs, the kswapd daemon scans these lists looking for pages to reclaim (kernel source: mm/vmscan.c).
The Memory Pressure Response
When free memory drops below a threshold, the kernel:
- Background reclamation (kswapd): A kernel thread wakes and starts scanning LRU lists
- Direct reclamation: If kswapd can’t keep up, the allocating process itself must wait while memory is freed
- OOM killer: Last resort-kill processes to free memory
Direct reclamation is the latency killer. Your trading thread asks for memory, and instead of getting it immediately, it waits while the kernel frees memory. This can take 10-100µs for page cache eviction or 1-10ms for swap operations.
For more on memory internals, see Memory Tuning Deep Dive.
Fix 1: Disable Aggressive Swapping {#swappiness}
The Problem
With vm.swappiness = 60 (default), the kernel treats file-backed and anonymous pages roughly equally when deciding what to evict. This means your application heap can be swapped to disk even when there’s plenty of page cache that could be evicted instead.
The kernel code: In mm/vmscan.c, the get_scan_count() function calculates how many pages to scan from each LRU list. The swappiness value directly influences this ratio (kernel source).
// Simplified from mm/vmscan.c
anon_prio = swappiness;
file_prio = 200 - swappiness;
With swappiness=60, the kernel scans anonymous pages at 60% the rate of file pages. With swappiness=0, anonymous pages are only scanned under extreme memory pressure.
The Fix
# Check current value
cat /proc/sys/vm/swappiness
# Output: 60 (default)
# Set to 0 for latency-critical systems
echo 'vm.swappiness = 0' | sudo tee -a /etc/sysctl.conf
sudo sysctl -p
# Verify
sysctl vm.swappiness
Terraform/Ansible automation:
# Ansible
- name: Set swappiness to 0
sysctl:
name: vm.swappiness
value: '0'
state: present
reload: yes
Why It Works (Kernel Internals)
With swappiness=0:
- Anonymous pages stay in RAM unless the system is critically low on memory
- Page cache is evicted first (this is safe-file contents can be re-read)
- Your trading heap never gets swapped unless the alternative is OOM
Verification:
# Watch for swap activity (should be 0)
vmstat 1 | awk '{print $7, $8}' # si (swap in), so (swap out)
# Check current swap usage
free -h
Expected Improvement
- Eliminates 10-100µs page fault stalls from swap reads (measured on NVMe)
- EBS/network storage: eliminates 1-5ms stalls
Citation: Page fault latency measured using eBPF tracing. See Brendan Gregg’s Memory Flame Graphs for methodology.
Fix 2: Disable Transparent Huge Pages {#thp}
The Problem
Transparent Huge Pages (THP) automatically promotes 4KB pages to 2MB pages, reducing TLB misses. Sounds good. The problem is how it does this.
The khugepaged kernel thread continuously scans memory looking for contiguous 4KB pages it can merge into 2MB pages. This requires:
- Memory compaction: Moving pages around to create contiguous regions
- Process stalling: Holding mmap_sem while promoting pages
The killer: Compaction can stall your process for 10-50 milliseconds. Not microseconds-milliseconds. During a THP compaction event, your trading thread is frozen.
The Kernel Internals
THP is managed by the khugepaged kernel thread (kernel source: mm/khugepaged.c). When enabled, it:
- Scans process address spaces every
khugepaged_scan_sleep_millisecs(default: 10000ms) - Attempts to collapse contiguous pages into huge pages
- May trigger memory compaction if huge pages aren’t available
The compaction problem: Memory compaction (mm/compaction.c) migrates pages between zones to create contiguous regions. This holds locks that can block memory allocation.
The Fix
# Check current status
cat /sys/kernel/mm/transparent_hugepage/enabled
# Output: [always] madvise never
# Disable THP entirely
echo never | sudo tee /sys/kernel/mm/transparent_hugepage/enabled
echo never | sudo tee /sys/kernel/mm/transparent_hugepage/defrag
# Verify
grep -E 'AnonHugePages|HugePages_' /proc/meminfo
Make persistent (EC2 user_data):
#!/bin/bash
echo never > /sys/kernel/mm/transparent_hugepage/enabled
echo never > /sys/kernel/mm/transparent_hugepage/defrag
For Kubernetes (DaemonSet):
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: disable-thp
spec:
selector:
matchLabels:
name: disable-thp
template:
spec:
hostPID: true
containers:
- name: disable-thp
image: busybox
command: ['sh', '-c', 'echo never > /sys/kernel/mm/transparent_hugepage/enabled && sleep infinity']
securityContext:
privileged: true
volumeMounts:
- name: sys
mountPath: /sys
volumes:
- name: sys
hostPath:
path: /sys
Why It Works (Kernel Internals)
Disabling THP:
- Stops
khugepagedfrom scanning your address space - Prevents compaction from holding mmap_sem
- Eliminates the 10-50ms stall risk
Verification:
# Watch for compaction activity
watch -n1 'grep -E "compact_|thp_" /proc/vmstat'
# Profile with perf during suspected stalls
sudo perf record -g -a sleep 10
sudo perf report
Expected Improvement
Eliminates 10-50ms compaction stalls. This is often the single biggest P99 improvement for trading systems.
Citation: THP compaction delays measured in Facebook’s THP Study and documented in kernel documentation.
Trade-off: You lose automatic huge page benefits. For controlled huge page usage, see explicit huge pages in Memory Tuning.
Fix 3: Lock CPU Frequency {#governors}
The Problem
Modern CPUs use Dynamic Voltage and Frequency Scaling (DVFS) to save power. The CPU governor decides when to change frequency based on load.
Governors explained:
| Governor | Behavior | Latency Impact |
|---|---|---|
| powersave | Minimum frequency always | Maximum latency on first instruction |
| ondemand | Ramps up when busy | 10-50µs ramp time |
| performance | Maximum frequency always | No ramp latency |
The ondemand governor (common default) monitors CPU utilization and ramps frequency. The problem: the first instructions after idle run at low frequency.
The Kernel Internals
CPU frequency scaling is managed by the cpufreq subsystem (kernel source: drivers/cpufreq/). When using the ondemand governor:
- A timer fires every
sampling_ratemicroseconds (default: 10000) - The governor checks CPU utilization
- If above threshold, frequency increases
- If below, frequency decreases
The latency: Frequency changes require voltage changes. The hardware needs 10-50µs to stabilize at the new frequency.
The Fix
# Check current governor
cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
# Output: ondemand (or powersave)
# Set all cores to performance
for cpu in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do
echo performance | sudo tee $cpu
done
# Verify frequency is at maximum
watch -n1 'grep MHz /proc/cpuinfo | head -4'
Ansible automation:
- name: Set CPU governor to performance
shell: |
for cpu in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do
echo performance > $cpu
done
become: yes
Why It Works (Kernel Internals)
The performance governor bypasses the sampling mechanism entirely. It sets the scaling_setspeed to the maximum frequency and leaves it there.
Verification:
# Confirm frequency is stable at max
turbostat --interval 1 --show Core,CPU,Bzy_MHz
Expected Improvement
Eliminates 10-50µs frequency ramp latency on first instructions after idle.
Citation: DVFS transition latencies documented in Intel Software Developer’s Manual, Vol. 3, Chapter 14.
Connection: For C-states (idle states), see the CPU Deep Dive. For NUMA effects on frequency, see CPU NUMA section.
Fix 4: Disable Deep C-States {#cstates}
The Problem
Even with the performance governor, idle CPUs enter C-states to save power:
| C-State | What Happens | Wake Latency |
|---|---|---|
| C0 | Active | 0 |
| C1 | Clock stopped | 1-5µs |
| C1E | Clock + voltage reduced | 5-10µs |
| C3 | L1/L2 cache cold | 30-50µs |
| C6 | Voltage cut, state saved to RAM | 50-100µs |
The problem: Your trading thread is idle for 1ms waiting for market data. The CPU enters C6. Market data arrives. The CPU takes 50-100µs to wake up.
The Kernel Internals
C-state management is handled by the intel_idle driver (kernel source: drivers/idle/intel_idle.c). When a CPU has no work:
- The scheduler calls
do_idle() do_idle()selects a C-state based on expected idle time- The CPU enters the selected state
- On interrupt, the CPU wakes and resumes
The selection algorithm: The cpuidle governor (menu or ladder) predicts idle time and picks the deepest state that can wake within the expected time. For unpredictable trading workloads, this prediction is often wrong.
The Fix
Option 1: Kernel boot parameters (recommended)
# Add to GRUB_CMDLINE_LINUX in /etc/default/grub
processor.max_cstate=1 intel_idle.max_cstate=0
This limits to C1 only-clock stops but voltage stays on.
Option 2: Runtime (temporary)
# Disable each C-state beyond C1
for cpu in /sys/devices/system/cpu/cpu*/cpuidle/state*/disable; do
echo 1 | sudo tee $cpu
done
Why It Works (Kernel Internals)
With max_cstate=1:
- CPUs enter C1 when idle (1-5µs wake)
- Never enter C3/C6 (50-100µs wake)
- Power consumption increases, but latency is predictable
Verification:
# Check current C-state residency
turbostat --interval 1 --show Core,C1%,C3%,C6%
# Should show 0% for C3 and C6
Expected Improvement
Reduces worst-case wake latency from 50-100µs to 1-5µs.
Citation: C-state latencies from Intel Power Management Reference.
Trade-off: Higher power consumption. On AWS, this increases instance cost. See the design philosophy section for when this tradeoff makes sense.
Fix 5: Disable NIC Offloads {#offloads}
The Problem
Network interface cards have offload features that batch packets to reduce CPU load:
| Offload | What It Does | Latency Impact |
|---|---|---|
| GRO | Batches incoming packets | 5-50µs delay |
| LRO | Batches incoming packets (legacy) | 5-50µs delay |
| TSO | Batches outgoing packets | Minimal for small packets |
| GSO | Generic segmentation offload | Minimal for small packets |
The problem: Your exchange sends a market data packet. The NIC waits to see if more packets are coming so it can batch them together. Your packet sits in the NIC for 5-50µs waiting for friends that may never arrive.
The Kernel Internals
GRO is implemented in net/core/dev.c (kernel source). The NIC driver calls napi_gro_receive() which:
- Holds the packet in a GRO list
- Waits for more packets from the same flow
- Merges packets into a larger buffer
- Delivers to the stack when flushed
The flush triggers: Timer expiry OR softirq batch complete OR driver-specific thresholds.
The Fix
# Check current offloads
ethtool -k eth0 | grep -E 'offload|segmentation'
# Disable receive offloads
sudo ethtool -K eth0 gro off lro off
# Verify
ethtool -k eth0 | grep gro
Ansible automation:
- name: Disable NIC offloads
shell: |
for iface in $(ls /sys/class/net | grep -v lo); do
ethtool -K $iface gro off lro off 2>/dev/null || true
done
become: yes
Why It Works (Kernel Internals)
With GRO/LRO off:
- Each packet triggers an immediate softirq
- No batching delay
- Trade-off: Higher CPU usage from more interrupts
Verification:
# Check for increased interrupt rate (expected)
watch -n1 'grep eth0 /proc/interrupts'
# Check for no drops (confirm hardware keeps up)
ethtool -S eth0 | grep -E 'drop|error'
Expected Improvement
Eliminates 5-50µs packet batching delay.
Citation: GRO behavior documented in kernel networking documentation.
Connection: For IRQ affinity tuning, see Network Deep Dive.
Design Philosophy {#design-philosophy}
The Fundamental Tradeoff: Throughput vs Latency
Every optimization in this post trades throughput/power for latency:
| Optimization | What We Give Up | What We Get |
|---|---|---|
| swappiness=0 | Page cache efficiency | Predictable heap |
| THP disabled | Automatic huge pages | No compaction stalls |
| performance governor | Power savings | No frequency ramp |
| C-state limits | Power savings | Fast wake-up |
| Offloads disabled | CPU efficiency | Immediate packet delivery |
The Linux design principle: The kernel defaults make the common case fast. Web servers benefit from more page cache. Batch jobs benefit from frequency scaling. The kernel is right to optimize for throughput-that’s what most workloads need.
Trading is different: We have:
- Known, bounded memory requirements
- Irregular, bursty workloads that fool frequency governors
- Hard latency SLOs that P99 spikes violate
When NOT to Apply These Changes
Not every system needs latency tuning:
- Batch processing: Throughput matters more; keep defaults
- Development environments: Don’t waste power
- Memory-constrained systems: swappiness=0 can trigger OOM
- Shared infrastructure: These settings affect all processes
The test: If your SLO is in seconds, defaults are fine. If your SLO is in milliseconds, audit your kernel.
For the philosophical framework, see First Principles of Trading Infrastructure.
Putting It All Together {#putting-it-together}
Quick Audit Commands
# Check all settings at once
echo "=== Swappiness ===" && sysctl vm.swappiness
echo "=== THP ===" && cat /sys/kernel/mm/transparent_hugepage/enabled
echo "=== Governor ===" && cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
echo "=== C-States ===" && cat /sys/devices/system/cpu/cpu0/cpuidle/state*/disable 2>/dev/null || echo "N/A"
echo "=== NIC Offloads ===" && ethtool -k eth0 2>/dev/null | grep -E 'gro:|lro:'
Automated Audit: latency-audit
I built latency-audit to check all these settings:
pip install latency-audit && latency-audit
Checks 30+ settings across kernel, CPU, memory, network, and storage. Works on AWS, GCP, bare-metal.
Terraform for AWS Fleet
resource "aws_launch_template" "trading" {
name_prefix = "trading-"
instance_type = "c6in.xlarge"
user_data = base64encode(<<-EOF
#!/bin/bash
# Kernel tuning
echo 'vm.swappiness = 0' >> /etc/sysctl.conf
sysctl -p
# THP
echo never > /sys/kernel/mm/transparent_hugepage/enabled
echo never > /sys/kernel/mm/transparent_hugepage/defrag
# CPU governor
for cpu in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do
echo performance > $cpu
done
# NIC offloads
for iface in $(ls /sys/class/net | grep -v lo); do
ethtool -K $iface gro off lro off 2>/dev/null || true
done
EOF
)
}
The kernel is optimized for throughput. Trading requires latency. Know the difference, and tune accordingly.
Up Next in Linux Infrastructure Deep Dives
PTP or Die: Hardware Timestamping for Regulatory-Grade Time Sync
Why NTP is fundamentally broken for HFT compliance, and how we implemented IEEE 1588 PTPv2 with hardware timestamping to achieve sub-100ns accuracy on Solarflare NICs.
Reading Path
Continue exploring with these related deep dives:
| Topic | Next Post |
|---|---|
| CPU governors, C-states, NUMA, isolation | CPU Isolation for HFT: The isolcpus Lie and What Actually Works |
| THP, huge pages, memory locking, pre-allocation | Memory Tuning for Low-Latency: The THP Trap and HugePage Mastery |
| NIC offloads, IRQ affinity, kernel bypass | Network Optimization: Kernel Bypass and the Art of Busy Polling |
| I/O schedulers, Direct I/O, EBS tuning | I/O Schedulers: Why the Kernel Reorders Your Writes |
| Measuring without overhead using eBPF | eBPF Profiling: Nanoseconds Without Adding Any |
| StatefulSets, pod placement, EKS patterns | Kubernetes StatefulSets: Why Trading Systems Need State |