Infrastructure
The Sub-50µs Cloud Lie: How We Actually Achieved Deterministic Latency on AWS
A deep dive into why cloud providers' latency claims are misleading, and the exact kernel bypass techniques we used to achieve deterministic sub-50µs RTT on c6i.metal instances.
At a top-tier crypto exchange, we inherited an AWS deployment averaging 180µs RTT. The vendor promised “sub-millisecond.”
After 3 months of kernel surgery, we hit 47µs P99-a 74% reduction. The vendor’s number wasn’t a lie; it was just measured in a vacuum. Production is not a vacuum.
This post documents the exact techniques we used to eliminate jitter on generic cloud hardware. If you’re running latency-sensitive workloads on AWS, you’re probably leaving 100µs on the table.
1. The Physics of Virtualization Jitter
Cloud latency jitter comes from three sources, all invisible to application code:
Source 1: Hypervisor Scheduling (The VMExit Tax)
Every privileged instruction (I/O, timer access) triggers a VMExit. The CPU traps to the hypervisor, context switches, and returns. On KVM, a single VMExit costs ~1µs.
- Source: Intel SDM Vol 3C, Chapter 25 - “VM Entries and VM Exits”
- Measurement:
perf kvm statshows exit counts.
Source 2: Noisy Neighbors (The Steal Time Tax)
Even on “dedicated” instances, the hypervisor’s management plane runs on your cores. This is invisible unless you check mpstat for %steal.
Source 3: NIC Virtualization (The VNIC Tax)
Standard ENAs route packets through the hypervisor’s virtual switch. Each packet incurs a copy and a context switch.
The Tax: Each hop adds ~20-40µs of non-deterministic delay.
2. The Decision Matrix
| Approach | P99 Latency | Complexity | Verdict |
|---|---|---|---|
| A. Standard ENA (Default) | ~180µs | Low | Baseline. Unacceptable for HFT. |
| B. ENA Express (AWS Feature) | ~100µs | Low | Marginal improvement. Still hypervisor-bound. |
| C. c6i.metal + DPDK | ~47µs | High | Selected. Full kernel bypass. |
Why Metal?
On .metal instances, the Nitro card presents the NIC directly via SR-IOV. There is no hypervisor vSwitch. The NIC is your hardware.
- Source: AWS Nitro System Whitepaper
3. The Kill: DPDK on Nitro
We bypass the kernel entirely using DPDK (Data Plane Development Kit).
Step 1: Bind the NIC to DPDK
# Load the vfio-pci driver
sudo modprobe vfio-pci
# Unbind from the kernel driver
sudo dpdk-devbind.py -u 0000:00:06.0
# Bind to DPDK
sudo dpdk-devbind.py -b vfio-pci 0000:00:06.0
Step 2: Poll in Userspace
Instead of waiting for interrupts, we poll the NIC’s RX ring in a tight loop.
// DPDK pseudo-code
while (1) {
nb_rx = rte_eth_rx_burst(port_id, queue_id, pkts, BURST_SIZE);
if (nb_rx > 0) {
process_packets(pkts, nb_rx);
}
}
Verification:
- Before:
ping -c 100 <target>shows P99 of ~200µs. - After: Custom DPDK
rte_rdtsc()loop shows P99 of ~47µs.
4. The Tool: Verifying Your Kernel State
Before you touch DPDK, ensure your kernel isn’t fighting you.
pip install latency-audit && latency-audit
latency-audit checks for common misconfigurations like irqbalance running (which moves your interrupts), nohz_full not set (which causes timer interrupts), and THP enabled (which causes memory stalls).
5. Systems Thinking: The Trade-offs
- Observability Loss: DPDK packets don’t appear in
tcpdumporiptables. You need custom tooling. - CPU Cost: Polling burns 100% of a dedicated core. On a 17,520/year per core.
- Operational Complexity: Your team must now understand userspace networking. This is a hiring filter.
6. The Philosophy
The cloud is not slow. Your assumptions about the cloud are slow.
Every abstraction has a cost. The kernel’s networking stack was designed for generality, not for sub-100µs latency. When you demand determinism, you must pay the complexity tax of bypassing the abstraction.
The question isn’t “Can we run HFT on AWS?” It’s “Are we willing to operate at the metal level while paying cloud prices?”
For most, the answer is no. For us, the $2M/year in improved fill rates made it obvious.
Up Next in Linux Infrastructure Deep Dives
PTP or Die: Hardware Timestamping for Regulatory-Grade Time Sync
Why NTP is fundamentally broken for HFT compliance, and how we implemented IEEE 1588 PTPv2 with hardware timestamping to achieve sub-100ns accuracy on Solarflare NICs.