Question 1

Colocated with exchange/matching engine?

Accepted Answer

Cloud = 2-20ms. Colo = sub-100µs. Why it matters: Light travels ~300km per millisecond. If you're in AWS us-east-1 and the exchange is in Secaucus, NJ, you're paying a ~2ms round-trip tax on every order. That's 2000 microseconds where a colocated competitor can react first. The math: At 10,000 trades/day, if you lose 1 tick on 10% of trades due to latency, and each tick is $0.10, that's $100/day or $25,000/year in lost alpha. What top firms do: Firms like Citadel, Jump, and Two Sigma pay $100k+/month for colo space within 100 me

Question 2

Using kernel bypass (DPDK/RDMA/XDP)?

Accepted Answer

Standard TCP/IP = 5+ kernel layers. Why it matters: A normal network packet traverses: NIC → Driver → Kernel Space → TCP/IP Stack → Socket Buffer → User Space. Each transition costs 1-5µs due to context switches and memory copies. The alternative: DPDK (Data Plane Development Kit) maps the NIC directly to user-space memory. Your application polls the NIC directly - zero kernel involvement. This reduces per-packet latency from ~15µs to ~1µs. Trade-off: You lose the kernel's TCP implementation. You must implement your own protocol h

Question 3

NIC IRQ affinity pinned to isolated cores?

Accepted Answer

Unpinned = 10-50µs jitter spikes. Why it matters: When a packet arrives, the NIC fires an interrupt (IRQ). By default, Linux can route this interrupt to any CPU. If your trading thread is on CPU 2 and the IRQ lands on CPU 5, the data must cross the CPU interconnect. Worse: If the IRQ lands on the same CPU as your trading thread, it interrupts your critical path. Both scenarios add 10-50µs of jitter. The fix: Pin NIC IRQs to dedicated cores that do nothing but handle interrupts. Keep trading threads on separate isolated cores.

Question 4

Hardware timestamping on NICs?

Accepted Answer

Software timestamps = µs-level drift. Why it matters: When you call `clock_gettime()`, you're asking the kernel for the current time. This involves a syscall (500ns+) and potential clock drift if the kernel is busy. Hardware timestamps: Modern NICs (Intel X710, Mellanox ConnectX) can stamp packets at the nanosecond level directly in hardware. This is essential for: - Proving execution time to regulators - Detecting MEV timestamp manipulation - Accurate latency measurement Accuracy: Software timestamps: 1-10µs error. Hardware timesta

Question 5

Single-threaded hot path (no locks)?

Accepted Answer

Mutex = unpredictable tail latency. Why it matters: A mutex lock in your critical path means threads can block each other. Even if contention is rare, when it happens, you see 50-100µs stalls. This destroys your P99 latency. Real numbers: Uncontended mutex: ~20ns. Contended mutex: 10,000-100,000ns. The pattern: Use a single-threaded event loop for order handling. Communicate with other threads via lock-free SPSC (Single-Producer Single-Consumer) queues.

Question 6

NUMA-aware memory allocation?

Accepted Answer

Cross-socket = +30-100ns per access. Why it matters: Modern servers have 2+ CPU sockets. Each socket has its own memory controller. Accessing memory "local" to your CPU: 70ns. Accessing memory on the other socket: 130ns. The trap: Linux allocates memory from any NUMA node by default. Your trading thread on Socket 0 might be reading order book data from Socket 1's memory. At scale: If you access memory 1 billion times/second, cross-socket access costs you an extra 60 seconds of latency per day.

Question 7

Pre-allocated memory pools?

Accepted Answer

malloc() = 100µs+ stalls. Why it matters: `malloc()` and `new` are not constant-time. They search free lists, may call `mmap()`, and can trigger page faults. Worst case: you wait for the kernel to find memory. Measured: Average malloc: 50ns. But P99 can spike to 100,000ns when memory is fragmented. The fix: Pre-allocate everything at startup. Use object pools or arena allocators. Your hot path should never call malloc.

Question 8

Order gateway on same NUMA node as NIC?

Accepted Answer

Cross-socket = QPI/UPI bus latency. Why it matters: The NIC is physically connected to one CPU socket. If your gateway process runs on the other socket, every packet crosses the inter-socket link (Intel QPI/UPI). Hidden cost: This adds ~60ns per packet each way. For a market data feed processing 1M messages/second, that's 60ms of extra latency per second. How to check: `lspci -v` shows which CPU the NIC is connected to. Match your process affinity to that socket.

Question 9

CPU isolation (isolcpus + nohz_full)?

Accepted Answer

Prevents OS from stealing cycles. Why it matters: Linux scheduler can migrate your thread to any CPU at any time. Even if it doesn't migrate, it can preempt your thread to run kernel tasks (RCU callbacks, timer ticks, etc). `isolcpus`: Tells Linux "never schedule anything on CPUs 2-7 unless explicitly asked." `nohz_full`: Disables the kernel's timer tick (normally 250Hz) on those CPUs. No tick = no interruptions. Combined effect: Your trading thread runs uninterrupted. P99 drops from 50µs to 5µs.

Question 10

Huge pages (2MB or 1GB)?

Accepted Answer

Reduces TLB misses 10-100x. Why it matters: CPUs use a TLB (Translation Lookaside Buffer) to cache page table lookups. With 4KB pages, a 1GB dataset needs 262,144 pages. TLB only holds ~1,500 entries → constant misses. With 2MB pages: Same 1GB dataset = 512 pages. Fits in TLB = zero misses. Impact: Each TLB miss costs 10-100 cycles (5-50ns). For memory-intensive HFT (order books, tick databases), this adds up to 1-5µs per operation.

Question 11

TCP_NODELAY and buffer tuning?

Accepted Answer

Nagle's algorithm = 40ms delay. Why it matters: By default, TCP uses Nagle's algorithm: it waits up to 40ms to batch small packets together. This is great for throughput, terrible for latency. TCP_NODELAY: Disables Nagle. Every `send()` goes out immediately. Buffer tuning: Default socket buffers are sized for throughput. For low-latency, you want smaller buffers to reduce queuing delay.

Question 12

C-states and P-states disabled?

Accepted Answer

Power savings = 2-100µs wake latency. Why it matters: Modern CPUs save power by entering "C-states" (sleep states) when idle. C1: 2µs wake-up. C3: 50µs. C6: 100µs+. The trap: Your trading thread is waiting for a packet. CPU goes to C3. Packet arrives. You pay 50µs to wake up before processing. P-states: CPU also varies frequency for power savings. Ramping from 2GHz to 3.5GHz takes 10-50µs.

Question 13

P99/P999 latency tracking?

Accepted Answer

Averages hide the trades you lose. Why it matters: Your average latency could be 10µs, but if P99 is 500µs, you're losing 1% of your trades to a competitor with consistent 50µs. The math: In a race to the exchange, you only need to be slow once to lose that trade. If you trade 100 times/second, P99 = 1 loss/second. What to track: - P50 (median): your typical case - P99: once per 100 events - P999: once per 1000 events - often reveals GC, page faults, or scheduler issues

Question 14

Real-time latency dashboards?

Accepted Answer

If you can't see it, you can't fix it. Why it matters: A latency regression at 2:47 PM on Tuesday will be invisible if you only look at daily averages. You need real-time visibility to correlate spikes with events. What to look for: - Latency heatmaps (not line charts) - Histogram distributions over time - Correlation with system events (GC, cron jobs, etc.) Alerting: Set up P99 alerts. If median is 10µs, alert at P99 > 50µs.

Nikhil Padala

HFT Infrastructure Scorecard

Physics & Network

Architecture

Software Config

Monitoring