The Physics of CPU Latency: Caches, Context Switches & Isolation

Why your code is slow. The physics of CPU Caches (L1/L2/L3), the 4µs cost of a Context Switch, and the `isolcpus` kernel boot parameter.

Intermediate 45 min read Expert Version →

🎯 What You'll Learn

  • Deconstruct the CPU Memory Hierarchy (L1 vs RAM)
  • Measure the exact cost of a Context Switch (Syscall Physics)
  • Configure Kernel Isolation (`isolcpus`, `nohz_full`)
  • Pin processes to specific cores using `taskset`
  • Analyze False Sharing (Cache Coherency Physics)

Introduction

In High-Frequency Trading (HFT), we don’t think in milliseconds. We think in Clock Cycles. A 4GHz CPU executes 4 Billion cycles per second. 1 Cycle = 0.25 nanoseconds.

When your code waits for RAM, it wastes 400 cycles. When the OS switches tasks, it wastes 12,000 cycles. This lesson explores the Physics of the CPU-how to keep data hot in L1 cache and how to banish the Kernel Scheduler from your trading cores.


The Speed of Light: Cache Physics

Data does not move instantly. It travels through silicon.

StorageLatency (ns)Cycles (4GHz)Physics Metaphor
L1 Cache1 ns4Picking a pen from your desk.
L2 Cache4 ns16Picking a book from the shelf.
L3 Cache12 ns48Walking to the next room.
RAM100 ns400Walking to the warehouse.

The Goal: Stick to L1. If you access a random memory address (Linked List), you hit RAM. If you access contiguous memory (Array), the CPU Prefetcher pulls it into L1.


Context Switches: The Invisible Tax

A Context Switch is when the CPU stops your code to run something else (another app, or the Kernel). It is catastrophic for latency.

The Physics:

  1. Save Registers: Using CPU to save state.
  2. Pollute L1 Cache: The new process overrides your hot data in L1.
  3. TLB Flush: The Translation Lookaside Buffer (Virtual Memory map) is wiped.

Cost: ~2-4 microseconds. (15,000 cycles). Solution: CPU Pinning.


Code: CPU Pinning & Isolation

We tell the Linux Scheduler: “Do not touch CPU 2 and 3”.

1. Boot Parameters (The Nuclear Option)

Edit /etc/default/grub:

# isolcpus: Remove from scheduler balancing
# nohz_full: Stop scheduling-clock ticks (1000Hz -> 1Hz)
# rcu_nocbs: Move RCU callbacks to housekeeping cores
GRUB_CMDLINE_LINUX="isolcpus=2,3 nohz_full=2,3 rcu_nocbs=2,3"

Note: Run update-grub and reboot.

2. Runtime Pinning (taskset)

Now CPU 2 and 3 are empty. You must manually force your app onto them.

# Launch a python script on CPU 2
taskset -c 2 python3 my_trading_algo.py

# Check affinity (Physics Verification)
pid=$(pgrep -f my_trading_algo)
taskset -p $pid
# output: pid 1234's current affinity mask: 4 (Binary 100 for CPU 2)

False Sharing: The Concurrency Killer

Imagine two threads on different Cores writing to variables that sit next to each other in RAM. CPUs cache data in 64-byte Cache Lines.

  • Thread A writes to Variable X.
  • Thread B writes to Variable Y.
  • If X and Y are in the same 64-byte line, Core A and Core B fight.
  • Physics: The Cache Coherency Protocol (MESI) forces L1 invalidation constantly. Code slows down by 50x.

Fix: Pad your data structures to ensure separation.


Practice Exercises

Exercise 1: The Context Switch Cost (Beginner)

Task: Use perf to measure context switches. Action: perf stat -e context-switches ./my_script. Goal: Minimize this number to zero.

Exercise 2: Cache Miss Profiling (Intermediate)

Task: Run perf stat -e L1-dcache-load-misses ./my_script. Action: Change a Linked List to an Array. Watch misses drop.

Exercise 3: Full Isolation (Advanced)

Task: Isolate CPU 3 via GRUB. Action: Run a busy-loop on CPU 3. Observation: Use htop. See that CPU 3 stays at 100% usage, but the Load Average doesn’t spike typically because regular scheduler tasks aren’t fighting for it.


Knowledge Check

  1. How many cycles does a RAM access cost?
  2. What is a Cache Line size?
  3. What does isolcpus do?
  4. Why is a Linked List slower than an Array?
  5. What is False Sharing?
Answers
  1. ~400 cycles.
  2. 64 Bytes.
  3. Removes a CPU from the kernel scheduler’s balancing algorithms.
  4. Pointer chasing. Arrays are contiguous and prefetch-friendly; Linked Lists are random memory jumps (RAM hits).
  5. Two cores fighting over the same Cache Line due to proximity of variables.

Summary

  • L1 Cache: The only fast storage.
  • Context Switch: A 15,000 cycle penalty.
  • isolcpus: Evicting the Scheduler.
  • False Sharing: The invisible concurrency bug.


Pro Version: For production-grade implementation details, see the full research article: cpu-optimization-linux-latency

Questions about this lesson? Working on related infrastructure?

Let's discuss