Infrastructure
eBPF Profiling: Nanoseconds Without Adding Any
Deep dive into eBPF, bpftrace, and kernel tracing. How to measure latency at nanosecond precision without the observer effect.
You cannot debug a latency spike if your debugger causes the spike.
Traditional profiling tools add significant overhead: strace costs 100-500µs per syscall, perf sampling uses 5-20% CPU. In a trading system processing 100K messages/second, these tools destroy the behavior you’re trying to measure.
eBPF changes this. Run sandboxed programs in the kernel, measure at nanosecond precision, add nearly zero overhead.
The Problem {#the-problem}
Traditional tools and their costs:
| Tool | Overhead | Why |
|---|---|---|
| strace | 100-500µs/syscall | ptrace stops process, copies data |
| ltrace | 200-800µs/call | Library interposition |
| perf (sampling) | 5-20% CPU | Interrupt-based sampling |
| gdb | Process stopped | Breakpoint traps |
The observer effect: The measurement changes the behavior. At trading speeds, this makes traditional tools useless for production debugging.
eBPF overhead: 100-500 nanoseconds per probe. 1000x less than strace.
For what to measure, see Monitoring Metrics. For kernel tuning that affects what you’ll observe, see the deep dive series:
Background: What Is eBPF? {#background}
The Architecture
eBPF (extended Berkeley Packet Filter) runs sandboxed programs in the Linux kernel (kernel source: kernel/bpf/).
User space program (bpftrace, bcc tool)
↓
BPF bytecode compilation
↓
Kernel verifier (safety checks)
↓
JIT compilation to native code
↓
Attached to kernel hook (kprobe, tracepoint, etc.)
↓
Runs on every event, collects data
↓
Maps (hash tables, arrays) export to user space
Why It’s Fast
- In-kernel execution: No context switches to user space
- JIT compiled: Runs at near-native speed
- Verified: Can’t crash the kernel
- No locks: Lock-free data structures for metrics
Available Hooks
| Hook Type | What It Traces | Example |
|---|---|---|
| kprobe | Function entry | kprobe:tcp_sendmsg |
| kretprobe | Function return | kretprobe:tcp_sendmsg |
| tracepoint | Static kernel events | tracepoint:sched:sched_switch |
| uprobe | User-space function | uprobe:/usr/bin/python:main |
| USDT | User-defined traces | Application-specific |
Citation: eBPF architecture documented in kernel documentation.
Tool 1: bpftrace One-Liners {#bpftrace}
Installation
# Ubuntu/Debian
sudo apt install bpftrace
# Amazon Linux 2023
sudo dnf install bpftrace
Syscall Latency Histogram
# Distribution of syscall latencies (microseconds)
sudo bpftrace -e '
tracepoint:raw_syscalls:sys_enter { @start[tid] = nsecs; }
tracepoint:raw_syscalls:sys_exit /@start[tid]/ {
@usecs = hist((nsecs - @start[tid]) / 1000);
delete(@start[tid]);
}'
Output:
@usecs:
[0] 12453 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[1] 4521 |@@@@@@@@@@@@@@@@@ |
[2, 4) 1234 |@@@@@ |
[4, 8) 342 |@ |
[8, 16) 45 | |
Per-Syscall Latency
# Which syscalls are slow?
sudo bpftrace -e '
tracepoint:raw_syscalls:sys_enter {
@start[tid] = nsecs;
@syscall[tid] = args->id;
}
tracepoint:raw_syscalls:sys_exit /@start[tid]/ {
@latency[@syscall[tid]] = hist((nsecs - @start[tid]) / 1000);
delete(@start[tid]);
delete(@syscall[tid]);
}'
Tool 2: Network Tracing {#network}
TCP Send Latency
# How long does tcp_sendmsg take in kernel?
sudo bpftrace -e '
kprobe:tcp_sendmsg { @start[tid] = nsecs; }
kretprobe:tcp_sendmsg /@start[tid]/ {
@tcp_send_ns = hist(nsecs - @start[tid]);
delete(@start[tid]);
}'
Packet Receive Path
# Latency from NIC to socket
sudo bpftrace -e '
tracepoint:net:netif_receive_skb { @rx_start[args->skbaddr] = nsecs; }
tracepoint:net:consume_skb /@rx_start[args->skbaddr]/ {
@rx_time_ns = hist(nsecs - @rx_start[args->skbaddr]);
delete(@rx_start[args->skbaddr]);
}'
Softirq Time
# Network softirq processing time
sudo bpftrace -e '
tracepoint:irq:softirq_entry /args->vec == 3/ { @start[tid] = nsecs; }
tracepoint:irq:softirq_exit /args->vec == 3 && @start[tid]/ {
@net_softirq_ns = hist(nsecs - @start[tid]);
delete(@start[tid]);
}'
For network tuning that affects these measurements, see Network Deep Dive.
Tool 3: Scheduler Latency {#scheduler}
Run Queue Latency
# How long do threads wait for CPU?
sudo bpftrace -e '
tracepoint:sched:sched_wakeup { @qtime[args->pid] = nsecs; }
tracepoint:sched:sched_switch /args->prev_state == 0/ {
if (@qtime[args->next_pid]) {
@runq_latency_us = hist((nsecs - @qtime[args->next_pid]) / 1000);
delete(@qtime[args->next_pid]);
}
}'
Interpretation:
- Spikes at 4ms = timer tick preemption (CONFIG_HZ=250)
- Long tail = CPU contention
Off-CPU Time
# Why is my thread not running?
sudo bpftrace -e '
tracepoint:sched:sched_switch {
@off_start[args->prev_pid] = nsecs;
if (@off_start[args->next_pid]) {
@off_cpu_us[args->next_comm] = hist((nsecs - @off_start[args->next_pid]) / 1000);
delete(@off_start[args->next_pid]);
}
}'
For scheduler tuning, see CPU Deep Dive.
Tool 4: Memory Allocation {#memory}
Malloc Latency
# How long does malloc take?
sudo bpftrace -e '
uprobe:/lib/x86_64-linux-gnu/libc.so.6:malloc {
@start[tid] = nsecs;
@size[tid] = arg0;
}
uretprobe:/lib/x86_64-linux-gnu/libc.so.6:malloc /@start[tid]/ {
@malloc_ns[@size[tid]] = hist(nsecs - @start[tid]);
delete(@start[tid]);
delete(@size[tid]);
}'
Page Faults
# Minor and major page fault latency
sudo bpftrace -e '
tracepoint:exceptions:page_fault_user { @fault_start[tid] = nsecs; }
tracepoint:exceptions:page_fault_user { @fault_end[tid] = nsecs; }
// Actually need different approach, use kprobes:
kprobe:handle_mm_fault { @start[tid] = nsecs; }
kretprobe:handle_mm_fault /@start[tid]/ {
@page_fault_us = hist((nsecs - @start[tid]) / 1000);
delete(@start[tid]);
}'
For memory tuning, see Memory Deep Dive.
Tool 5: Block I/O {#block-io}
Request Latency
# Block I/O latency by device
sudo bpftrace -e '
tracepoint:block:block_rq_issue { @issue[args->sector] = nsecs; }
tracepoint:block:block_rq_complete /@issue[args->sector]/ {
@io_latency_us = hist((nsecs - @issue[args->sector]) / 1000);
delete(@issue[args->sector]);
}'
I/O Size Distribution
# What sizes are we reading/writing?
sudo bpftrace -e '
tracepoint:block:block_rq_issue {
@io_size_bytes = hist(args->bytes);
}'
For storage tuning, see Storage Deep Dive.
Production Deployment {#production}
Continuous Monitoring with BCC
#!/usr/bin/env python3
# syscall_latency_exporter.py
from bcc import BPF
from prometheus_client import start_http_server, Histogram
import time
bpf_program = """
#include <uapi/linux/ptrace.h>
BPF_HASH(start, u32);
BPF_HISTOGRAM(dist);
TRACEPOINT_PROBE(raw_syscalls, sys_enter) {
u32 pid = bpf_get_current_pid_tgid();
u64 ts = bpf_ktime_get_ns();
start.update(&pid, &ts);
return 0;
}
TRACEPOINT_PROBE(raw_syscalls, sys_exit) {
u32 pid = bpf_get_current_pid_tgid();
u64 *tsp = start.lookup(&pid);
if (tsp) {
u64 delta = bpf_ktime_get_ns() - *tsp;
dist.increment(bpf_log2l(delta / 1000)); // microseconds
start.delete(&pid);
}
return 0;
}
"""
SYSCALL_LATENCY = Histogram(
'syscall_latency_microseconds',
'Syscall latency',
buckets=[1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024]
)
def main():
b = BPF(text=bpf_program)
start_http_server(9101)
while True:
time.sleep(10)
for k, v in b["dist"].items():
bucket_us = 2 ** k.value
SYSCALL_LATENCY.observe(bucket_us)
b["dist"].clear()
if __name__ == "__main__":
main()
Kubernetes DaemonSet
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: ebpf-exporter
spec:
template:
spec:
hostPID: true
hostNetwork: true
containers:
- name: exporter
image: ebpf-exporter:latest
securityContext:
privileged: true
volumeMounts:
- name: bpf
mountPath: /sys/fs/bpf
- name: debug
mountPath: /sys/kernel/debug
volumes:
- name: bpf
hostPath:
path: /sys/fs/bpf
- name: debug
hostPath:
path: /sys/kernel/debug
Design Philosophy {#design-philosophy}
Measure, Don’t Assume
Before optimizing:
- Measure current state (where is latency?)
- Make change
- Measure again (did it help?)
The pattern: Most optimization efforts target the wrong thing. Measurement reveals the true bottleneck.
The Observer Effect Quantified
| Tool | Overhead/Event | Events/sec Impact |
|---|---|---|
| printf | 10-50µs | Severe |
| strace | 100-500µs | Unusable |
| perf | 0.1-1µs | 5-20% CPU |
| eBPF | 0.1-0.5µs | <1% CPU |
The tradeoff: eBPF requires kernel knowledge to write and interpret. Worth it for production observability.
When to Use Each Tool
| Situation | Tool |
|---|---|
| Quick debugging | bpftrace one-liner |
| Production monitoring | BCC + Prometheus |
| Detailed analysis | perf + flame graphs |
| Syscall list | strace (development only) |
Audit Your Infrastructure
Profiling is only half the battle. Run latency-audit to verify your kernel is configured for low-latency workloads before you start measuring.
pip install latency-audit && latency-audit Up Next in Linux Infrastructure Deep Dives
Market Data Infrastructure: WebSocket Patterns That Scale
Deep dive into WebSocket reliability, sequence gap detection, Kubernetes patterns, and monitoring for multi-exchange market data systems.
Reading Path
Continue exploring with these related deep dives:
| Topic | Next Post |
|---|---|
| SLOs, metrics that matter, alerting | Trading Metrics: What SRE Dashboards Miss |
| CPU governors, C-states, NUMA, isolation | CPU Isolation for HFT: The isolcpus Lie and What Actually Works |
| THP, huge pages, memory locking, pre-allocation | Memory Tuning for Low-Latency: The THP Trap and HugePage Mastery |
| NIC offloads, IRQ affinity, kernel bypass | Network Optimization: Kernel Bypass and the Art of Busy Polling |
| Design philosophy & architecture decisions | Trading Infrastructure: First Principles That Scale |