Infrastructure Audit

HFT Infrastructure Scorecard

Learn how institutional trading firms build low-latency systems. Toggle each item to calculate your score.

0 /115
+ 0 µs

Physics & Network

The speed of light doesn't negotiate

These are physics constraints. No software optimization can overcome them.

Colocated with exchange/matching engine?

+120µs

Cloud = 2-20ms. Colo = sub-100µs.

Learn more

Why it matters: Light travels ~300km per millisecond. If you're in AWS us-east-1 and the exchange is in Secaucus, NJ, you're paying a ~2ms round-trip tax on every order. That's 2000 microseconds where a colocated competitor can react first.

The math: At 10,000 trades/day, if you lose 1 tick on 10% of trades due to latency, and each tick is $0.10, that's $100/day or $25,000/year in lost alpha.

What top firms do: Firms like Citadel, Jump, and Two Sigma pay $100k+/month for colo space within 100 meters of the matching engine.

# No software fix exists
# CME: Aurora, IL or Secaucus, NJ
# NYSE: Mahwah, NJ
# Binance/FTX: AWS ap-northeast-1 (Tokyo)

Using kernel bypass (DPDK/RDMA/XDP)?

+80µs

Standard TCP/IP = 5+ kernel layers.

Learn more

Why it matters: A normal network packet traverses: NIC → Driver → Kernel Space → TCP/IP Stack → Socket Buffer → User Space. Each transition costs 1-5µs due to context switches and memory copies.

The alternative: DPDK (Data Plane Development Kit) maps the NIC directly to user-space memory. Your application polls the NIC directly - zero kernel involvement. This reduces per-packet latency from ~15µs to ~1µs.

Trade-off: You lose the kernel's TCP implementation. You must implement your own protocol handling or use a library like Seastar.

$ modprobe vfio-pci
$ dpdk-devbind --bind=vfio-pci 0000:01:00.0
$ ./dpdk-app -l 0-3 -n 4

NIC IRQ affinity pinned to isolated cores?

+40µs

Unpinned = 10-50µs jitter spikes.

Learn more

Why it matters: When a packet arrives, the NIC fires an interrupt (IRQ). By default, Linux can route this interrupt to any CPU. If your trading thread is on CPU 2 and the IRQ lands on CPU 5, the data must cross the CPU interconnect.

Worse: If the IRQ lands on the same CPU as your trading thread, it interrupts your critical path. Both scenarios add 10-50µs of jitter.

The fix: Pin NIC IRQs to dedicated cores that do nothing but handle interrupts. Keep trading threads on separate isolated cores.

# Find IRQ number
$ cat /proc/interrupts | grep eth0
# Pin IRQ 42 to CPU 1 (bitmask 0x2)
$ echo 2 > /proc/irq/42/smp_affinity

Hardware timestamping on NICs?

+15µs

Software timestamps = µs-level drift.

Learn more

Why it matters: When you call `clock_gettime()`, you're asking the kernel for the current time. This involves a syscall (500ns+) and potential clock drift if the kernel is busy.

Hardware timestamps: Modern NICs (Intel X710, Mellanox ConnectX) can stamp packets at the nanosecond level directly in hardware. This is essential for: - Proving execution time to regulators - Detecting MEV timestamp manipulation - Accurate latency measurement

Accuracy: Software timestamps: 1-10µs error. Hardware timestamps: <50ns error.

$ ethtool -T eth0
# Look for: hardware-transmit, hardware-receive
# Enable PTP:
$ ptp4l -i eth0 -m

Architecture

Structure determines your ceiling

Your code architecture creates hard limits. Choose wrong, and no tuning will save you.

Single-threaded hot path (no locks)?

+60µs

Mutex = unpredictable tail latency.

Learn more

Why it matters: A mutex lock in your critical path means threads can block each other. Even if contention is rare, when it happens, you see 50-100µs stalls. This destroys your P99 latency.

Real numbers: Uncontended mutex: ~20ns. Contended mutex: 10,000-100,000ns.

The pattern: Use a single-threaded event loop for order handling. Communicate with other threads via lock-free SPSC (Single-Producer Single-Consumer) queues.

# Lock-free queue example (C++):
template<typename T>
class SPSCQueue {
std::atomic<size_t> head, tail;
T buffer[SIZE];
};

NUMA-aware memory allocation?

+35µs

Cross-socket = +30-100ns per access.

Learn more

Why it matters: Modern servers have 2+ CPU sockets. Each socket has its own memory controller. Accessing memory "local" to your CPU: 70ns. Accessing memory on the other socket: 130ns.

The trap: Linux allocates memory from any NUMA node by default. Your trading thread on Socket 0 might be reading order book data from Socket 1's memory.

At scale: If you access memory 1 billion times/second, cross-socket access costs you an extra 60 seconds of latency per day.

# Check NUMA topology
$ numactl --hardware
# Run pinned to node 0
$ numactl --membind=0 --cpunodebind=0 ./trading

Pre-allocated memory pools?

+50µs

malloc() = 100µs+ stalls.

Learn more

Why it matters: `malloc()` and `new` are not constant-time. They search free lists, may call `mmap()`, and can trigger page faults. Worst case: you wait for the kernel to find memory.

Measured: Average malloc: 50ns. But P99 can spike to 100,000ns when memory is fragmented.

The fix: Pre-allocate everything at startup. Use object pools or arena allocators. Your hot path should never call malloc.

// Arena allocator pattern
class Arena {
char* buffer;
size_t offset = 0;
public:
void* alloc(size_t n) {
void* p = buffer + offset;
offset += n;
return p;
}
};

Order gateway on same NUMA node as NIC?

+25µs

Cross-socket = QPI/UPI bus latency.

Learn more

Why it matters: The NIC is physically connected to one CPU socket. If your gateway process runs on the other socket, every packet crosses the inter-socket link (Intel QPI/UPI).

Hidden cost: This adds ~60ns per packet each way. For a market data feed processing 1M messages/second, that's 60ms of extra latency per second.

How to check: `lspci -v` shows which CPU the NIC is connected to. Match your process affinity to that socket.

# Find NIC NUMA node
$ cat /sys/class/net/eth0/device/numa_node
# Should output 0 or 1
# Pin your process to that node

Software Config

The devil is in the kernel parameters

Linux defaults are optimized for throughput and power savings, not latency. You must override them.

CPU isolation (isolcpus + nohz_full)?

+25µs

Prevents OS from stealing cycles.

Learn more

Why it matters: Linux scheduler can migrate your thread to any CPU at any time. Even if it doesn't migrate, it can preempt your thread to run kernel tasks (RCU callbacks, timer ticks, etc).

`isolcpus`: Tells Linux "never schedule anything on CPUs 2-7 unless explicitly asked."

`nohz_full`: Disables the kernel's timer tick (normally 250Hz) on those CPUs. No tick = no interruptions.

Combined effect: Your trading thread runs uninterrupted. P99 drops from 50µs to 5µs.

# /etc/default/grub:
GRUB_CMDLINE_LINUX="isolcpus=2-7 nohz_full=2-7 rcu_nocbs=2-7"
$ grub2-mkconfig -o /boot/grub2/grub.cfg
$ reboot

Huge pages (2MB or 1GB)?

+30µs

Reduces TLB misses 10-100x.

Learn more

Why it matters: CPUs use a TLB (Translation Lookaside Buffer) to cache page table lookups. With 4KB pages, a 1GB dataset needs 262,144 pages. TLB only holds ~1,500 entries → constant misses.

With 2MB pages: Same 1GB dataset = 512 pages. Fits in TLB = zero misses.

Impact: Each TLB miss costs 10-100 cycles (5-50ns). For memory-intensive HFT (order books, tick databases), this adds up to 1-5µs per operation.

# Reserve 1024 huge pages (2GB)
$ echo 1024 > /proc/sys/vm/nr_hugepages
# Mount hugetlbfs
$ mount -t hugetlbfs none /mnt/huge
# Use in code: mmap with MAP_HUGETLB

TCP_NODELAY and buffer tuning?

+20µs

Nagle's algorithm = 40ms delay.

Learn more

Why it matters: By default, TCP uses Nagle's algorithm: it waits up to 40ms to batch small packets together. This is great for throughput, terrible for latency.

TCP_NODELAY: Disables Nagle. Every `send()` goes out immediately.

Buffer tuning: Default socket buffers are sized for throughput. For low-latency, you want smaller buffers to reduce queuing delay.

// Disable Nagle
int one = 1;
setsockopt(fd, IPPROTO_TCP, TCP_NODELAY, &one, sizeof(one));
// Tune kernel buffers
$ sysctl -w net.core.rmem_max=16777216
$ sysctl -w net.core.wmem_max=16777216

C-states and P-states disabled?

+15µs

Power savings = 2-100µs wake latency.

Learn more

Why it matters: Modern CPUs save power by entering "C-states" (sleep states) when idle. C1: 2µs wake-up. C3: 50µs. C6: 100µs+.

The trap: Your trading thread is waiting for a packet. CPU goes to C3. Packet arrives. You pay 50µs to wake up before processing.

P-states: CPU also varies frequency for power savings. Ramping from 2GHz to 3.5GHz takes 10-50µs.

# BIOS: Disable C1E, C3, C6, C7
# Kernel boot params:
intel_idle.max_cstate=0 processor.max_cstate=0
# Or per-core:
$ cpupower frequency-set -g performance

Monitoring

You can't fix what you don't measure

Without proper measurement, you're optimizing blind. Averages lie. Tails tell the truth.

P99/P999 latency tracking?

-

Averages hide the trades you lose.

Learn more

Why it matters: Your average latency could be 10µs, but if P99 is 500µs, you're losing 1% of your trades to a competitor with consistent 50µs.

The math: In a race to the exchange, you only need to be slow once to lose that trade. If you trade 100 times/second, P99 = 1 loss/second.

What to track: - P50 (median): your typical case - P99: once per 100 events - P999: once per 1000 events - often reveals GC, page faults, or scheduler issues

// HDR Histogram (recommended)
#include <hdr/hdr_histogram.h>
struct hdr_histogram* hist;
hdr_init(1, 1000000, 3, &hist);
hdr_record_value(hist, latency_us);

Real-time latency dashboards?

-

If you can't see it, you can't fix it.

Learn more

Why it matters: A latency regression at 2:47 PM on Tuesday will be invisible if you only look at daily averages. You need real-time visibility to correlate spikes with events.

What to look for: - Latency heatmaps (not line charts) - Histogram distributions over time - Correlation with system events (GC, cron jobs, etc.)

Alerting: Set up P99 alerts. If median is 10µs, alert at P99 > 50µs.

# Prometheus + Grafana setup
$ cat /etc/prometheus/prometheus.yml
scrape_configs:
- job_name: trading
static_configs:
- targets: [localhost:9090]

Toggle items to assess your infrastructure.

Want to discuss your infrastructure challenges?

Connect on LinkedIn

This scorecard is based on infrastructure patterns from Akuna Capital, Gemini, and other institutional trading firms.