The Physics of Networking: From NIC to Socket
Why your ping is 0.1ms but your app is 10ms. The physics of DMA, Ring Buffers, SoftIRQs (NET_RX), and the Socket Buffer (sk_buff).
🎯 What You'll Learn
- Trace a packet's physical journey (NIC -> RAM -> CPU)
- Deconstruct the `sk_buff` (Socket Buffer) structure
- Analyze the SoftIRQ Bottleneck (`ksoftirqd`)
- Explain NAPI (New API) Polling vs Interrupts
- Tune Sysctl for 10Gbps+ Throughput
📚 Prerequisites
Before this lesson, you should understand:
Introduction
Most developers think networking happens in the “Cloud.” Systems Engineers know networking happens in a Ring Buffer.
When you send a request to a database, you aren’t just sending “data.” You are triggering a violent chain reaction of electrical interrupts, memory copies, and context switches.
This lesson traces the Physics of a Packet: The exact sequence of events from the moment a photon hits your Network Card to the moment your Node.js app fires a callback.
The Physics: The Path of Ingress (RX)
A packet arrives at the NIC (Network Interface Card). What happens next?
Phase 1: The Hardware (DMA)
The CPU is too slow to read packets one by one. Instead, the NIC uses Direct Memory Access (DMA) to write the packet directly into a pre-allocated space in RAM called the RX Ring Buffer.
- Physics: The packet is in RAM, but the CPU doesn’t know it yet.
- Action: The NIC fires a Hard Interrupt (IRQ) to wake up the CPU.
Phase 2: The SoftIRQ (NET_RX)
The Hard IRQ handler must be insanely fast. It cannot process TCP logic. It signals: “Hey kernel, there’s work to do,” triggers a SoftIRQ, and exits.
This keeps the CPU responsive. The heavy lifting happens later in ksoftirqd context.
Phase 3: NAPI Polling
In the old days, 10,000 packets meant 10,000 Interrupts. This caused “Receive Livelock”-the CPU spent 100% of its time handling interrupts and 0% processing data. Solution: NAPI (New API).
- First packet -> Interrupt.
- Kernel disables Interrupts for that NIC.
- Kernel Polls the Ring Buffer until it is empty.
- Re-enables Interrupts.
Deep Dive: The sk_buff (Socket Buffer)
The struct sk_buff is the most important data structure in the Linux sub-atomic universe. It represents a packet.
Crucially, Linux never copies packet data if it can avoid it.
It just passes pointers to this sk_buff structure around.
The Anatomy
head: Start of the allocated buffer.data: Start of the actual packet data (e.g., skips the Ethernet Header).tail: End of the packet data.end: End of the allocated buffer.
Physics of Parsing:
When the kernel parses headers (Ethernet -> IP -> TCP), it doesn’t move memory. It just increments the data pointer.
skb_pull(skb, 14) -> effectively “strips” the Ethernet header by moving the pointer forward 14 bytes. Zero cost.
Strategy: Zero Copy Networking
Why is read() slow?
Because it forces a Context Switch AND a Memory Copy.
Packet (Kernel RAM) -> App Buffer (User RAM).
The Solution: sendfile() or mmap().
These syscalls allow the Kernel to send data from Disk -> Network without ever copying it to User Space.
- Result: The CPU does nothing. DMA handles Disk -> RAM, and RAM -> NIC.
- Throughput: 10Gbps+ on single-core.
Code: Tuning for High Throughput
To handle 10Gbps or high packet rates, default Linux settings are insufficient.
# 1. Enlarge the Ring Buffers (NIC Hardware Queue)
# Prevents packet drops at the hardware level during micro-bursts
ethtool -G eth0 rx 4096 tx 4096
# 2. Distribute Interrupts (Packet Steering)
# Ensures not just CPU Core 0 is handling all network traffic
# (Receive Packet Steering)
sysctl -w net.core.rps_sock_flow_entries=32768
# 3. Increase SoftIRQ budget
# Allow the kernel to process more packets before yielding CPU
sysctl -w net.core.netdev_budget=600
# 4. Enlarge TCP Window limits (BDP - Bandwidth Delay Product)
sysctl -w net.ipv4.tcp_rmem="4096 87380 16777216"
sysctl -w net.ipv4.tcp_wmem="4096 65536 16777216"
Practice Exercises
Exercise 1: The Drop (Beginner)
Task: Run ethtool -S eth0 | grep drop.
Observation: If rx_queue_0_drops is increasing, your Ring Buffer is too small. The kernel isn’t polling fast enough to empty the hardware queue.
Exercise 2: The SoftIRQ Storm (Intermediate)
Task: Run top and look at the %si (SoftIRQ) CPU usage field.
Scenario: If %si is 100% on one core (e.g., Cpu0), you are bottlenecked by interrupt handling.
Fix: Enable RPS (Receive Packet Steering) to spread the load to other cores.
Exercise 3: Zero Copy Benchmark (Advanced)
Task: Compare cat largefile > /dev/null vs sending it via sendfile() (using a tool like nginx).
Result: sendfile consumes drastically less CPU because the data never crosses the User/Kernel boundary.
Knowledge Check
- What performs the copy from NIC to RAM?
- Why did the “Interrupt Livelock” happen before NAPI?
- What does
skb_pull()actually do to the memory? - Why is
ksoftirqdusage high during heavy network load? - What is the “Bandwidth-Delay Product”?
Answers
- DMA (Direct Memory Access). The CPU is not involved.
- CPU Starvation. The CPU spent all its cycles entering/exiting interrupt handlers, doing no actual work.
- Nothing. It just increments a pointer (advances the start offset).
- SoftIRQ Processing. This is the kernel thread dedicated to processing the backlog of packets from the Ring Buffer.
- Buffer Size. Throughput * RTT. The amount of data “in flight” that needs to be buffered for max speed.
Summary
- RX Ring: The hardware parking lot.
- SoftIRQ: The kernel worker thread.
- sk_buff: The pointer-based packet structure.
- Zero Copy: The art of doing nothing.
Questions about this lesson? Working on related infrastructure?
Let's discuss