The $2M Millisecond: Linux Defaults That Cost You Money

I’ve audited trading servers at HFT firms and crypto exchanges for over a decade. 80% fail the same 5 kernel configuration checks.

The default Linux kernel is optimized for throughput-maximizing total work completed-not latency-minimizing response time. This is intentional. The kernel developers designed for the common case: web servers, databases, and batch processing where throughput matters more than individual request latency.

Trading is not the common case. A 100µs delay on a $1M order costs real money. This post goes deep into the kernel internals that cause these delays, why they exist, and how to fix them.

The Problem {#the-problem}

Every Linux distribution ships with defaults designed for general-purpose workloads:

Setting	Default Value	Latency Impact	Root Cause
vm.swappiness	60	10-100µs per page fault	Anonymous page reclamation
THP	always	10-50ms compaction stalls	khugepaged defragmentation
CPU governor	powersave	10-50µs frequency ramp	DVFS transitions
C-states	enabled	50-100µs wake latency	Voltage/clock restoration
NIC offloads	enabled	5-50µs packet batching	GRO/LRO coalescing

Why do these defaults exist? They save power and maximize throughput. The kernel developers made reasonable tradeoffs for 99% of workloads. Trading systems are in the other 1%.

For deep dives into each subsystem, see:

CPU Optimization - Governors, C-states, NUMA
Memory Tuning - THP, swappiness, huge pages
Network Optimization - Offloads, IRQ affinity
Storage I/O - Schedulers, Direct I/O

Background: How Linux Memory Management Works {#background}

Before diving into fixes, you need to understand how Linux manages memory. This context explains why the defaults hurt latency.

The Page Cache and Anonymous Pages

Linux divides memory into two categories:

File-backed pages (page cache): Cached file contents. Can be evicted by dropping them (clean) or writing back (dirty).
Anonymous pages: Heap, stack, and mmap’d memory without a file backing. Can only be evicted by swapping to disk.

The kernel maintains LRU (Least Recently Used) lists for both types. When memory pressure occurs, the kswapd daemon scans these lists looking for pages to reclaim (kernel source: mm/vmscan.c).

The Memory Pressure Response

When free memory drops below a threshold, the kernel:

Background reclamation (kswapd): A kernel thread wakes and starts scanning LRU lists
Direct reclamation: If kswapd can’t keep up, the allocating process itself must wait while memory is freed
OOM killer: Last resort-kill processes to free memory

Direct reclamation is the latency killer. Your trading thread asks for memory, and instead of getting it immediately, it waits while the kernel frees memory. This can take 10-100µs for page cache eviction or 1-10ms for swap operations.

For more on memory internals, see Memory Tuning Deep Dive.

Fix 1: Disable Aggressive Swapping {#swappiness}

The Problem

With vm.swappiness = 60 (default), the kernel treats file-backed and anonymous pages roughly equally when deciding what to evict. This means your application heap can be swapped to disk even when there’s plenty of page cache that could be evicted instead.

The kernel code: In mm/vmscan.c, the get_scan_count() function calculates how many pages to scan from each LRU list. The swappiness value directly influences this ratio (kernel source).

// Simplified from mm/vmscan.c
anon_prio = swappiness;
file_prio = 200 - swappiness;

With swappiness=60, the kernel scans anonymous pages at 60% the rate of file pages. With swappiness=0, anonymous pages are only scanned under extreme memory pressure.

The Fix

# Check current value
cat /proc/sys/vm/swappiness
# Output: 60 (default)

# Set to 0 for latency-critical systems
echo 'vm.swappiness = 0' | sudo tee -a /etc/sysctl.conf
sudo sysctl -p

# Verify
sysctl vm.swappiness

Terraform/Ansible automation:

# Ansible
- name: Set swappiness to 0
  sysctl:
    name: vm.swappiness
    value: '0'
    state: present
    reload: yes

Why It Works (Kernel Internals)

With swappiness=0:

Anonymous pages stay in RAM unless the system is critically low on memory
Page cache is evicted first (this is safe-file contents can be re-read)
Your trading heap never gets swapped unless the alternative is OOM

Verification:

# Watch for swap activity (should be 0)
vmstat 1 | awk '{print $7, $8}'  # si (swap in), so (swap out)

# Check current swap usage
free -h

Expected Improvement

Eliminates 10-100µs page fault stalls from swap reads (measured on NVMe)
EBS/network storage: eliminates 1-5ms stalls

Citation: Page fault latency measured using eBPF tracing. See Brendan Gregg’s Memory Flame Graphs for methodology.

Fix 2: Disable Transparent Huge Pages {#thp}

The Problem

Transparent Huge Pages (THP) automatically promotes 4KB pages to 2MB pages, reducing TLB misses. Sounds good. The problem is how it does this.

The khugepaged kernel thread continuously scans memory looking for contiguous 4KB pages it can merge into 2MB pages. This requires:

Memory compaction: Moving pages around to create contiguous regions
Process stalling: Holding mmap_sem while promoting pages

The killer: Compaction can stall your process for 10-50 milliseconds. Not microseconds-milliseconds. During a THP compaction event, your trading thread is frozen.

The Kernel Internals

THP is managed by the khugepaged kernel thread (kernel source: mm/khugepaged.c). When enabled, it:

Scans process address spaces every khugepaged_scan_sleep_millisecs (default: 10000ms)
Attempts to collapse contiguous pages into huge pages
May trigger memory compaction if huge pages aren’t available

The compaction problem: Memory compaction (mm/compaction.c) migrates pages between zones to create contiguous regions. This holds locks that can block memory allocation.

The Fix

# Check current status
cat /sys/kernel/mm/transparent_hugepage/enabled
# Output: [always] madvise never

# Disable THP entirely
echo never | sudo tee /sys/kernel/mm/transparent_hugepage/enabled
echo never | sudo tee /sys/kernel/mm/transparent_hugepage/defrag

# Verify
grep -E 'AnonHugePages|HugePages_' /proc/meminfo

Make persistent (EC2 user_data):

#!/bin/bash
echo never > /sys/kernel/mm/transparent_hugepage/enabled
echo never > /sys/kernel/mm/transparent_hugepage/defrag

For Kubernetes (DaemonSet):

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: disable-thp
spec:
  selector:
    matchLabels:
      name: disable-thp
  template:
    spec:
      hostPID: true
      containers:
      - name: disable-thp
        image: busybox
        command: ['sh', '-c', 'echo never > /sys/kernel/mm/transparent_hugepage/enabled && sleep infinity']
        securityContext:
          privileged: true
        volumeMounts:
        - name: sys
          mountPath: /sys
      volumes:
      - name: sys
        hostPath:
          path: /sys

Why It Works (Kernel Internals)

Disabling THP:

Stops khugepaged from scanning your address space
Prevents compaction from holding mmap_sem
Eliminates the 10-50ms stall risk

Verification:

# Watch for compaction activity
watch -n1 'grep -E "compact_|thp_" /proc/vmstat'

# Profile with perf during suspected stalls
sudo perf record -g -a sleep 10
sudo perf report

Expected Improvement

Eliminates 10-50ms compaction stalls. This is often the single biggest P99 improvement for trading systems.

Citation: THP compaction delays measured in Facebook’s THP Study and documented in kernel documentation.

Trade-off: You lose automatic huge page benefits. For controlled huge page usage, see explicit huge pages in Memory Tuning.

Fix 3: Lock CPU Frequency {#governors}

The Problem

Modern CPUs use Dynamic Voltage and Frequency Scaling (DVFS) to save power. The CPU governor decides when to change frequency based on load.

Governors explained:

Governor	Behavior	Latency Impact
powersave	Minimum frequency always	Maximum latency on first instruction
ondemand	Ramps up when busy	10-50µs ramp time
performance	Maximum frequency always	No ramp latency

The ondemand governor (common default) monitors CPU utilization and ramps frequency. The problem: the first instructions after idle run at low frequency.

The Kernel Internals

CPU frequency scaling is managed by the cpufreq subsystem (kernel source: drivers/cpufreq/). When using the ondemand governor:

A timer fires every sampling_rate microseconds (default: 10000)
The governor checks CPU utilization
If above threshold, frequency increases
If below, frequency decreases

The latency: Frequency changes require voltage changes. The hardware needs 10-50µs to stabilize at the new frequency.

The Fix

# Check current governor
cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
# Output: ondemand (or powersave)

# Set all cores to performance
for cpu in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do
  echo performance | sudo tee $cpu
done

# Verify frequency is at maximum
watch -n1 'grep MHz /proc/cpuinfo | head -4'

Ansible automation:

- name: Set CPU governor to performance
  shell: |
    for cpu in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do
      echo performance > $cpu
    done
  become: yes

Why It Works (Kernel Internals)

The performance governor bypasses the sampling mechanism entirely. It sets the scaling_setspeed to the maximum frequency and leaves it there.

Verification:

# Confirm frequency is stable at max
turbostat --interval 1 --show Core,CPU,Bzy_MHz

Expected Improvement

Eliminates 10-50µs frequency ramp latency on first instructions after idle.

Citation: DVFS transition latencies documented in Intel Software Developer’s Manual, Vol. 3, Chapter 14.

Connection: For C-states (idle states), see the CPU Deep Dive. For NUMA effects on frequency, see CPU NUMA section.

Fix 4: Disable Deep C-States {#cstates}

The Problem

Even with the performance governor, idle CPUs enter C-states to save power:

C-State	What Happens	Wake Latency
C0	Active	0
C1	Clock stopped	1-5µs
C1E	Clock + voltage reduced	5-10µs
C3	L1/L2 cache cold	30-50µs
C6	Voltage cut, state saved to RAM	50-100µs

The problem: Your trading thread is idle for 1ms waiting for market data. The CPU enters C6. Market data arrives. The CPU takes 50-100µs to wake up.

The Kernel Internals

C-state management is handled by the intel_idle driver (kernel source: drivers/idle/intel_idle.c). When a CPU has no work:

The scheduler calls do_idle()
do_idle() selects a C-state based on expected idle time
The CPU enters the selected state
On interrupt, the CPU wakes and resumes

The selection algorithm: The cpuidle governor (menu or ladder) predicts idle time and picks the deepest state that can wake within the expected time. For unpredictable trading workloads, this prediction is often wrong.

The Fix

Option 1: Kernel boot parameters (recommended)

# Add to GRUB_CMDLINE_LINUX in /etc/default/grub
processor.max_cstate=1 intel_idle.max_cstate=0

This limits to C1 only-clock stops but voltage stays on.

Option 2: Runtime (temporary)

# Disable each C-state beyond C1
for cpu in /sys/devices/system/cpu/cpu*/cpuidle/state*/disable; do
  echo 1 | sudo tee $cpu
done

Why It Works (Kernel Internals)

With max_cstate=1:

CPUs enter C1 when idle (1-5µs wake)
Never enter C3/C6 (50-100µs wake)
Power consumption increases, but latency is predictable

Verification:

# Check current C-state residency
turbostat --interval 1 --show Core,C1%,C3%,C6%

# Should show 0% for C3 and C6

Expected Improvement

Reduces worst-case wake latency from 50-100µs to 1-5µs.

Citation: C-state latencies from Intel Power Management Reference.

Trade-off: Higher power consumption. On AWS, this increases instance cost. See the design philosophy section for when this tradeoff makes sense.

Fix 5: Disable NIC Offloads {#offloads}

The Problem

Network interface cards have offload features that batch packets to reduce CPU load:

Offload	What It Does	Latency Impact
GRO	Batches incoming packets	5-50µs delay
LRO	Batches incoming packets (legacy)	5-50µs delay
TSO	Batches outgoing packets	Minimal for small packets
GSO	Generic segmentation offload	Minimal for small packets

The problem: Your exchange sends a market data packet. The NIC waits to see if more packets are coming so it can batch them together. Your packet sits in the NIC for 5-50µs waiting for friends that may never arrive.

The Kernel Internals

GRO is implemented in net/core/dev.c (kernel source). The NIC driver calls napi_gro_receive() which:

Holds the packet in a GRO list
Waits for more packets from the same flow
Merges packets into a larger buffer
Delivers to the stack when flushed

The flush triggers: Timer expiry OR softirq batch complete OR driver-specific thresholds.

The Fix

# Check current offloads
ethtool -k eth0 | grep -E 'offload|segmentation'

# Disable receive offloads
sudo ethtool -K eth0 gro off lro off

# Verify
ethtool -k eth0 | grep gro

Ansible automation:

- name: Disable NIC offloads
  shell: |
    for iface in $(ls /sys/class/net | grep -v lo); do
      ethtool -K $iface gro off lro off 2>/dev/null || true
    done
  become: yes

Why It Works (Kernel Internals)

With GRO/LRO off:

Each packet triggers an immediate softirq
No batching delay
Trade-off: Higher CPU usage from more interrupts

Verification:

# Check for increased interrupt rate (expected)
watch -n1 'grep eth0 /proc/interrupts'

# Check for no drops (confirm hardware keeps up)
ethtool -S eth0 | grep -E 'drop|error'

Expected Improvement

Eliminates 5-50µs packet batching delay.

Citation: GRO behavior documented in kernel networking documentation.

Connection: For IRQ affinity tuning, see Network Deep Dive.

Design Philosophy {#design-philosophy}

The Fundamental Tradeoff: Throughput vs Latency

Every optimization in this post trades throughput/power for latency:

Optimization	What We Give Up	What We Get
swappiness=0	Page cache efficiency	Predictable heap
THP disabled	Automatic huge pages	No compaction stalls
performance governor	Power savings	No frequency ramp
C-state limits	Power savings	Fast wake-up
Offloads disabled	CPU efficiency	Immediate packet delivery

The Linux design principle: The kernel defaults make the common case fast. Web servers benefit from more page cache. Batch jobs benefit from frequency scaling. The kernel is right to optimize for throughput-that’s what most workloads need.

Trading is different: We have:

Known, bounded memory requirements
Irregular, bursty workloads that fool frequency governors
Hard latency SLOs that P99 spikes violate

When NOT to Apply These Changes

Not every system needs latency tuning:

Batch processing: Throughput matters more; keep defaults
Development environments: Don’t waste power
Memory-constrained systems: swappiness=0 can trigger OOM
Shared infrastructure: These settings affect all processes

The test: If your SLO is in seconds, defaults are fine. If your SLO is in milliseconds, audit your kernel.

For the philosophical framework, see First Principles of Trading Infrastructure.

Putting It All Together {#putting-it-together}

Quick Audit Commands

# Check all settings at once
echo "=== Swappiness ===" && sysctl vm.swappiness
echo "=== THP ===" && cat /sys/kernel/mm/transparent_hugepage/enabled
echo "=== Governor ===" && cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
echo "=== C-States ===" && cat /sys/devices/system/cpu/cpu0/cpuidle/state*/disable 2>/dev/null || echo "N/A"
echo "=== NIC Offloads ===" && ethtool -k eth0 2>/dev/null | grep -E 'gro:|lro:'

Automated Audit: latency-audit

I built latency-audit to check all these settings:

pip install latency-audit && latency-audit

Checks 30+ settings across kernel, CPU, memory, network, and storage. Works on AWS, GCP, bare-metal.

→ GitHub | PyPI

Terraform for AWS Fleet

resource "aws_launch_template" "trading" {
  name_prefix   = "trading-"
  instance_type = "c6in.xlarge"
  
  user_data = base64encode(<<-EOF
    #!/bin/bash
    # Kernel tuning
    echo 'vm.swappiness = 0' >> /etc/sysctl.conf
    sysctl -p
    
    # THP
    echo never > /sys/kernel/mm/transparent_hugepage/enabled
    echo never > /sys/kernel/mm/transparent_hugepage/defrag
    
    # CPU governor
    for cpu in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do
      echo performance > $cpu
    done
    
    # NIC offloads
    for iface in $(ls /sys/class/net | grep -v lo); do
      ethtool -K $iface gro off lro off 2>/dev/null || true
    done
  EOF
  )
}

The kernel is optimized for throughput. Trading requires latency. Know the difference, and tune accordingly.

Topic	Next Post
CPU governors, C-states, NUMA, isolation	CPU Isolation for HFT: The isolcpus Lie and What Actually Works
THP, huge pages, memory locking, pre-allocation	Memory Tuning for Low-Latency: The THP Trap and HugePage Mastery
NIC offloads, IRQ affinity, kernel bypass	Network Optimization: Kernel Bypass and the Art of Busy Polling
I/O schedulers, Direct I/O, EBS tuning	I/O Schedulers: Why the Kernel Reorders Your Writes
Measuring without overhead using eBPF	eBPF Profiling: Nanoseconds Without Adding Any
StatefulSets, pod placement, EKS patterns	Kubernetes StatefulSets: Why Trading Systems Need State

The Problem {#the-problem}

Background: How Linux Memory Management Works {#background}

The Page Cache and Anonymous Pages

The Memory Pressure Response

Fix 1: Disable Aggressive Swapping {#swappiness}

The Problem

The Fix

Why It Works (Kernel Internals)

Expected Improvement

Fix 2: Disable Transparent Huge Pages {#thp}

The Problem

The Kernel Internals

The Fix

Why It Works (Kernel Internals)

Expected Improvement

Fix 3: Lock CPU Frequency {#governors}

The Problem

The Kernel Internals

The Fix

Why It Works (Kernel Internals)

Expected Improvement

Fix 4: Disable Deep C-States {#cstates}

The Problem

The Kernel Internals

The Fix

Why It Works (Kernel Internals)

Expected Improvement

Fix 5: Disable NIC Offloads {#offloads}

The Problem

The Kernel Internals

The Fix

Why It Works (Kernel Internals)

Expected Improvement

Design Philosophy {#design-philosophy}

The Fundamental Tradeoff: Throughput vs Latency

When NOT to Apply These Changes

Putting It All Together {#putting-it-together}

Quick Audit Commands

Automated Audit: latency-audit

Terraform for AWS Fleet

PTP or Die: Hardware Timestamping for Regulatory-Grade Time Sync

Reading Path