The Physics of Memory: TLB, HugePages & page_fault_latency

Virtual Memory is an illusion. The physics of the Translation Lookaside Buffer (TLB), why 4KB pages are too small for HFT, and the cost of a Page Table Walk.

Intermediate • 45 min read • Expert Version →

🎯 What You'll Learn

Deconstruct Virtual Memory (VA -> PA Translation)
Measure the latency of a TLB Miss (Page Table Walk Physics)
Implement 1GB HugePages (`hugepagesz=1G`)
Audit `thp_collapse_scan` (The 100ms Latency Spike)
Tune NUMA Locality with `numactl`

📚 Prerequisites

Before this lesson, you should understand:

Introduction

In the Kernel, memory does not exist as a contiguous block. It is fragmented. The CPU creates an illusion called Virtual Memory. It breaks RAM into 4KB Pages. For a 32GB stack, that is 8 Million entries in the lookup table (Page Table).

Every single memory access must consult this table. If the mapping is not in the CPU cache (TLB), the CPU must walk the table in slower RAM. This lesson explores TLB Physics and why massive pages (1GB) are critical for speed.

The Physics: TLB Reach

The TLB (Translation Lookaside Buffer) is a tiny cache inside the CPU Core that stores address mappings.

L1 TLB Size: ~64 entries.
Access Time: 1-2 ns.

The Math of Reach:

With 4KB Pages: $64 \times 4KB = 256KB$ Reach.
With 1GB HugePages: $64 \times 1GB = 64GB$ Reach.

Physics Outcome: If your trading engine uses 4GB of RAM with 4KB pages, the TLB covers 0.006% of your memory. You will suffer TLB Thrashing. Every access will miss, forcing a Page Table Walk (400 cycles).

The Silent Killer: Transparent Huge Pages (THP)

The Kernel tries to help. It has a background thread (khugepaged) that merges 4KB pages into 2MB pages automatically. This is deadly for latency.

Stop-the-World: The kernel locks the memory region to merge pages.
Spike: Your application freezes for 10ms - 200ms during compaction.
Randomness: It happens unpredictably.

Rule #1 of HFT: Disable THP.

echo never > /sys/kernel/mm/transparent_hugepage/enabled
echo never > /sys/kernel/mm/transparent_hugepage/defrag

Code: Allocating 1GB HugePages

We don’t want 2MB pages. typically. We want 1GB Pages to cover the entire heap in a single TLB entry.

1. Boot Parameters

Edit /etc/default/grub:

# Reserve 4 x 1GB Pages. Total 4GB.
GRUB_CMDLINE_LINUX="default_hugepagesz=1G hugepagesz=1G hugepages=4"

Note: This MUST be done at boot to ensure the RAM is physically contiguous.

2. Consuming HugePages (C/C++)

Standard malloc() won’t use this. You need mmap with flags.

#include <sys/mman.h>

void* get_huge_memory() {
    // Request 1GB
    size_t size = 1024 * 1024 * 1024; 
    
    // MAP_HUGETLB is the key
    void* ptr = mmap(NULL, size, PROT_READ | PROT_WRITE,
                     MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB, -1, 0);
                     
    if (ptr == MAP_FAILED) {
        perror("mmap failed. Did you reserve HugePages?");
        exit(1);
    }
    
    // Lock it to prevent swap
    if (mlock(ptr, size) == -1) {
        perror("mlock failed");
    }
    
    return ptr;
}

NUMA: The Speed of Light (Again)

On dual-socket servers, Half the RAM is near CPU 0, Half is near CPU 1. Accessing “Remote Node” RAM travels over the QPI/UPI interconnect. Physics Penalty: +40ns latency (Approx 30% slower).

Solution: numactl

# Run on CPU Node 0, Use ONLY RAM from Node 0
numactl --cpunodebind=0 --membind=0 ./my_trading_app

Practice Exercises

Exercise 1: TLB Miss Hunt (Beginner)

Task: perf stat -e dTLB-load-misses ./app. Action: Run with standard malloc vs HugePages. Observation: Watch misses drop from Millions to Zero.

Exercise 2: The THP Scan (Intermediate)

Task: Enable THP always. Action: Run a fragmentation script. Watch grep thp_collapse_scan /proc/vmstat. Goal: See the counters incrementing (Kernel interference).

Exercise 3: NUMA Benchmarking (Advanced)

Task: numactl --membind=1 ./app (Force remote memory) while running on CPU 0. Action: Measure bandwidth. Compare with --membind=0 (Local).

Knowledge Check

What is the TLB Reach of 4KB pages vs 1GB pages?
Why is khugepaged bad for latency?
Can you allocate 1GB pages while the system is running?
What syscall locks memory to prevent swapping?
What is the latency penalty of Remote NUMA access?

Answers

256KB vs 64GB. (Assuming 64-entry L1 TLB).
Compaction Stalls. It pauses memory access to merge pages.
Rarely. Memory is usually too fragmented. Do it at boot.
mlock().
~40ns. (Roughly 160 cycles).

Summary

TLB: The most important cache for Big Data.
HugePages: The only way to increase TLB Reach.
THP: Good for web servers, bad for trading.
NUMA: Local RAM is fast RAM.

Pro Version: For production-grade implementation details, see the full research article: memory-tuning-linux-latency

Pro Version: See the full research: Memory Tuning for Linux Latency

Questions about this lesson? Working on related infrastructure?

Let's discuss