The Physics of Matching Engines: Speed & Determinism

Why the fastest engines in the world are Single-Threaded. Ring Buffers, CPU Cache alignment, and the LMAX Disruptor.

Intermediate 45 min read Expert Version →

🎯 What You'll Learn

  • Deconstruct the 'Single-Threaded Determinism' rule
  • Analyze the LMAX Disruptor Pattern (Ring Buffers)
  • Calculate Cache Miss penalties (`std::map` vs Arrays)
  • Trace a packet from NIC to Matching Logic (Kernel Bypass)
  • Architect a Lock-Free Order Processing Pipeline

📚 Prerequisites

Before this lesson, you should understand:

Introduction

In Web Development, we scale by adding threads. In High Frequency Trading, adding a thread is a death sentence.

Why? Locks. A context switch costs 2 microseconds. A lock contention costs 5 microseconds. A modern Matching Engine matches an order in 50 nanoseconds. If you use a lock, you are 100x slower.

This lesson explores the physics of building the fastest software on Earth.


The Physics: Single-Threaded Determinism

Markets must be Fair. If Client A sends an order at t=1 and Client B sends at t=2, A must execute before B. If you have multi-threaded matching, Thread B might win a race condition against Thread A. This is illegal.

The Solution: The Sequencer. All incoming packets are serialized into a single stream. A Single Thread processes this stream. No locks are needed because only one thread owns the memory.


Deep Dive: The Ring Buffer (LMAX Disruptor)

How do you pass data between the Network Thread (Writer) and the Matching Thread (Reader) without a lock? The Ring Buffer.

Imagine a circular array of 1 million slots.

  • Writer: “I wrote to slot 100. I am moving my cursor to 101.”
  • Reader: “I see the Writer is at 101. I can safely read 100.”

Physics:

  • Memory Barrier: Used instead of Locks (Cost: ~10ns vs ~5000ns).
  • Cache Locality: The array is contiguous in RAM, so the CPU pre-fetcher pulls the next order into L1 Cache before you even ask for it.

The Enemy: Cache Misses

Why std::map (Red-Black Tree) is banned in HFT: It is a “Node-Based” container. Node A is at address 0x100. Node B is at 0x9000. To traverse the tree, the CPU must fetch 0x100, wait 100ns, then fetch 0x9000.

The Fix: Flat Vectors. We pre-allocate std::vector<Order>. Scanning a vector is 100x faster than walking a tree, even if the algorithm is O(N)O(N) vs O(logN)O(log N), because of Hardware Prefetching.


Code: The Disruptor Pattern

A conceptual implementation of a lock-free ring buffer.

class RingBuffer:
    def __init__(self, size=1024):
        self.buffer = [None] * size
        self.size = size
        self.writer_cursor = 0
        self.reader_cursor = 0
        
    def write(self, data):
        # In C++, we would use memory_order_release here
        next_slot = (self.writer_cursor + 1) % self.size
        if next_slot == self.reader_cursor:
            raise Exception("Buffer Full")
            
        self.buffer[self.writer_cursor] = data
        self.writer_cursor = next_slot # Atomic commit
        
    def read(self):
        # In C++, we would use memory_order_acquire here
        if self.reader_cursor == self.writer_cursor:
            return None # Empty
        
        data = self.buffer[self.reader_cursor]
        self.reader_cursor = (self.reader_cursor + 1) % self.size
        return data

Practice Exercises

Exercise 1: Cache Miss Math (Beginner)

Scenario: L1 Cache hit = 1ns. RAM access = 100ns. Task: Algorithm A does 100 L1 hits. Algorithm B does 2 RAM accesses. Which is faster? (A = 100ns. B = 200ns. The “slower” algorithm wins if it fits in cache).

Exercise 2: Lock Contention (Intermediate)

Scenario: Two threads fight for a Mutex. Task: Why does the OS put the waiting thread to sleep? What occurs (Context Switch)? Why is this unacceptable? (Sleep = >10us latency).

Exercise 3: Sequence Numbers (Advanced)

Task: The Matching Engine crashes. How do you recover? Solution: Replay the Journal. Since the engine is deterministic, replaying the same input stream produces the exact same state.


Knowledge Check

  1. Why are Matching Engines single-threaded?
  2. What is a Ring Buffer?
  3. Why do HFTs avoid new / malloc at runtime?
  4. What is a “Memory Barrier”?
  5. Why is a flat array faster than a tree for small datasets?
Answers
  1. Determinism and Speed. No race conditions, no lock overhead.
  2. Circular Queue. A fixed-size array used to pass data between threads lock-free.
  3. OS Overhead. Allocation involves the Kernel and can trigger Page Faults.
  4. CPU Instruction. It forces the CPU to flush write buffers, ensuring other cores see the data. Cheaper than a lock.
  5. Cache Locality. Arrays are contiguous; Trees are scattered in RAM.

Summary

  • Logic: Single-Threaded > Multi-Threaded.
  • Data: Arrays > Trees.
  • Communication: Ring Buffers > Queues.

Questions about this lesson? Working on related infrastructure?

Let's discuss