eBPF Profiling: Nanoseconds Without Adding Any

You cannot debug a latency spike if your debugger causes the spike.

Traditional profiling tools add significant overhead: strace costs 100-500µs per syscall, perf sampling uses 5-20% CPU. In a trading system processing 100K messages/second, these tools destroy the behavior you’re trying to measure.

eBPF changes this. Run sandboxed programs in the kernel, measure at nanosecond precision, add nearly zero overhead.

The Problem {#the-problem}

Traditional tools and their costs:

Tool	Overhead	Why
strace	100-500µs/syscall	ptrace stops process, copies data
ltrace	200-800µs/call	Library interposition
perf (sampling)	5-20% CPU	Interrupt-based sampling
gdb	Process stopped	Breakpoint traps

The observer effect: The measurement changes the behavior. At trading speeds, this makes traditional tools useless for production debugging.

eBPF overhead: 100-500 nanoseconds per probe. 1000x less than strace.

For what to measure, see Monitoring Metrics. For kernel tuning that affects what you’ll observe, see the deep dive series:

Background: What Is eBPF? {#background}

The Architecture

eBPF (extended Berkeley Packet Filter) runs sandboxed programs in the Linux kernel (kernel source: kernel/bpf/).

User space program (bpftrace, bcc tool)
    ↓
BPF bytecode compilation
    ↓
Kernel verifier (safety checks)
    ↓
JIT compilation to native code
    ↓
Attached to kernel hook (kprobe, tracepoint, etc.)
    ↓
Runs on every event, collects data
    ↓
Maps (hash tables, arrays) export to user space

Why It’s Fast

In-kernel execution: No context switches to user space
JIT compiled: Runs at near-native speed
Verified: Can’t crash the kernel
No locks: Lock-free data structures for metrics

Available Hooks

Hook Type	What It Traces	Example
kprobe	Function entry	`kprobe:tcp_sendmsg`
kretprobe	Function return	`kretprobe:tcp_sendmsg`
tracepoint	Static kernel events	`tracepoint:sched:sched_switch`
uprobe	User-space function	`uprobe:/usr/bin/python:main`
USDT	User-defined traces	Application-specific

Citation: eBPF architecture documented in kernel documentation.

Tool 1: bpftrace One-Liners {#bpftrace}

Installation

# Ubuntu/Debian
sudo apt install bpftrace

# Amazon Linux 2023
sudo dnf install bpftrace

Syscall Latency Histogram

# Distribution of syscall latencies (microseconds)
sudo bpftrace -e '
tracepoint:raw_syscalls:sys_enter { @start[tid] = nsecs; }
tracepoint:raw_syscalls:sys_exit /@start[tid]/ {
  @usecs = hist((nsecs - @start[tid]) / 1000);
  delete(@start[tid]);
}'

Output:

@usecs:
[0]                  12453 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[1]                   4521 |@@@@@@@@@@@@@@@@@                                   |
[2, 4)                1234 |@@@@@                                               |
[4, 8)                 342 |@                                                   |
[8, 16)                 45 |                                                    |

Per-Syscall Latency

# Which syscalls are slow?
sudo bpftrace -e '
tracepoint:raw_syscalls:sys_enter { 
  @start[tid] = nsecs; 
  @syscall[tid] = args->id;
}
tracepoint:raw_syscalls:sys_exit /@start[tid]/ {
  @latency[@syscall[tid]] = hist((nsecs - @start[tid]) / 1000);
  delete(@start[tid]);
  delete(@syscall[tid]);
}'

Tool 2: Network Tracing {#network}

TCP Send Latency

# How long does tcp_sendmsg take in kernel?
sudo bpftrace -e '
kprobe:tcp_sendmsg { @start[tid] = nsecs; }
kretprobe:tcp_sendmsg /@start[tid]/ {
  @tcp_send_ns = hist(nsecs - @start[tid]);
  delete(@start[tid]);
}'

Packet Receive Path

# Latency from NIC to socket
sudo bpftrace -e '
tracepoint:net:netif_receive_skb { @rx_start[args->skbaddr] = nsecs; }
tracepoint:net:consume_skb /@rx_start[args->skbaddr]/ {
  @rx_time_ns = hist(nsecs - @rx_start[args->skbaddr]);
  delete(@rx_start[args->skbaddr]);
}'

Softirq Time

# Network softirq processing time
sudo bpftrace -e '
tracepoint:irq:softirq_entry /args->vec == 3/ { @start[tid] = nsecs; }
tracepoint:irq:softirq_exit /args->vec == 3 && @start[tid]/ {
  @net_softirq_ns = hist(nsecs - @start[tid]);
  delete(@start[tid]);
}'

For network tuning that affects these measurements, see Network Deep Dive.

Tool 3: Scheduler Latency {#scheduler}

Run Queue Latency

# How long do threads wait for CPU?
sudo bpftrace -e '
tracepoint:sched:sched_wakeup { @qtime[args->pid] = nsecs; }
tracepoint:sched:sched_switch /args->prev_state == 0/ {
  if (@qtime[args->next_pid]) {
    @runq_latency_us = hist((nsecs - @qtime[args->next_pid]) / 1000);
    delete(@qtime[args->next_pid]);
  }
}'

Interpretation:

Spikes at 4ms = timer tick preemption (CONFIG_HZ=250)
Long tail = CPU contention

Off-CPU Time

# Why is my thread not running?
sudo bpftrace -e '
tracepoint:sched:sched_switch {
  @off_start[args->prev_pid] = nsecs;
  if (@off_start[args->next_pid]) {
    @off_cpu_us[args->next_comm] = hist((nsecs - @off_start[args->next_pid]) / 1000);
    delete(@off_start[args->next_pid]);
  }
}'

For scheduler tuning, see CPU Deep Dive.

Tool 4: Memory Allocation {#memory}

Malloc Latency

# How long does malloc take?
sudo bpftrace -e '
uprobe:/lib/x86_64-linux-gnu/libc.so.6:malloc { 
  @start[tid] = nsecs; 
  @size[tid] = arg0;
}
uretprobe:/lib/x86_64-linux-gnu/libc.so.6:malloc /@start[tid]/ {
  @malloc_ns[@size[tid]] = hist(nsecs - @start[tid]);
  delete(@start[tid]);
  delete(@size[tid]);
}'

Page Faults

# Minor and major page fault latency
sudo bpftrace -e '
tracepoint:exceptions:page_fault_user { @fault_start[tid] = nsecs; }
tracepoint:exceptions:page_fault_user { @fault_end[tid] = nsecs; }

// Actually need different approach, use kprobes:
kprobe:handle_mm_fault { @start[tid] = nsecs; }
kretprobe:handle_mm_fault /@start[tid]/ {
  @page_fault_us = hist((nsecs - @start[tid]) / 1000);
  delete(@start[tid]);
}'

For memory tuning, see Memory Deep Dive.

Tool 5: Block I/O {#block-io}

Request Latency

# Block I/O latency by device
sudo bpftrace -e '
tracepoint:block:block_rq_issue { @issue[args->sector] = nsecs; }
tracepoint:block:block_rq_complete /@issue[args->sector]/ {
  @io_latency_us = hist((nsecs - @issue[args->sector]) / 1000);
  delete(@issue[args->sector]);
}'

I/O Size Distribution

# What sizes are we reading/writing?
sudo bpftrace -e '
tracepoint:block:block_rq_issue {
  @io_size_bytes = hist(args->bytes);
}'

For storage tuning, see Storage Deep Dive.

Production Deployment {#production}

Continuous Monitoring with BCC

#!/usr/bin/env python3
# syscall_latency_exporter.py
from bcc import BPF
from prometheus_client import start_http_server, Histogram
import time

bpf_program = """
#include <uapi/linux/ptrace.h>

BPF_HASH(start, u32);
BPF_HISTOGRAM(dist);

TRACEPOINT_PROBE(raw_syscalls, sys_enter) {
    u32 pid = bpf_get_current_pid_tgid();
    u64 ts = bpf_ktime_get_ns();
    start.update(&pid, &ts);
    return 0;
}

TRACEPOINT_PROBE(raw_syscalls, sys_exit) {
    u32 pid = bpf_get_current_pid_tgid();
    u64 *tsp = start.lookup(&pid);
    if (tsp) {
        u64 delta = bpf_ktime_get_ns() - *tsp;
        dist.increment(bpf_log2l(delta / 1000));  // microseconds
        start.delete(&pid);
    }
    return 0;
}
"""

SYSCALL_LATENCY = Histogram(
    'syscall_latency_microseconds',
    'Syscall latency',
    buckets=[1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024]
)

def main():
    b = BPF(text=bpf_program)
    start_http_server(9101)
    
    while True:
        time.sleep(10)
        for k, v in b["dist"].items():
            bucket_us = 2 ** k.value
            SYSCALL_LATENCY.observe(bucket_us)
        b["dist"].clear()

if __name__ == "__main__":
    main()

Kubernetes DaemonSet

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: ebpf-exporter
spec:
  template:
    spec:
      hostPID: true
      hostNetwork: true
      containers:
      - name: exporter
        image: ebpf-exporter:latest
        securityContext:
          privileged: true
        volumeMounts:
        - name: bpf
          mountPath: /sys/fs/bpf
        - name: debug
          mountPath: /sys/kernel/debug
      volumes:
      - name: bpf
        hostPath:
          path: /sys/fs/bpf
      - name: debug
        hostPath:
          path: /sys/kernel/debug

Design Philosophy {#design-philosophy}

Measure, Don’t Assume

Before optimizing:

Measure current state (where is latency?)
Make change
Measure again (did it help?)

The pattern: Most optimization efforts target the wrong thing. Measurement reveals the true bottleneck.

The Observer Effect Quantified

Tool	Overhead/Event	Events/sec Impact
printf	10-50µs	Severe
strace	100-500µs	Unusable
perf	0.1-1µs	5-20% CPU
eBPF	0.1-0.5µs	<1% CPU

The tradeoff: eBPF requires kernel knowledge to write and interpret. Worth it for production observability.

When to Use Each Tool

Situation	Tool
Quick debugging	bpftrace one-liner
Production monitoring	BCC + Prometheus
Detailed analysis	perf + flame graphs
Syscall list	strace (development only)

Audit Your Infrastructure

Profiling is only half the battle. Run latency-audit to verify your kernel is configured for low-latency workloads before you start measuring.

pip install latency-audit && latency-audit

Topic	Next Post
SLOs, metrics that matter, alerting	Trading Metrics: What SRE Dashboards Miss
CPU governors, C-states, NUMA, isolation	CPU Isolation for HFT: The isolcpus Lie and What Actually Works
THP, huge pages, memory locking, pre-allocation	Memory Tuning for Low-Latency: The THP Trap and HugePage Mastery
NIC offloads, IRQ affinity, kernel bypass	Network Optimization: Kernel Bypass and the Art of Busy Polling
Design philosophy & architecture decisions	Trading Infrastructure: First Principles That Scale

Nikhil Padala