The Physics of Networking: From NIC to Socket

Why your ping is 0.1ms but your app is 10ms. The physics of DMA, Ring Buffers, SoftIRQs (NET_RX), and the Socket Buffer (sk_buff).

Intermediate 50 min read Expert Version →

🎯 What You'll Learn

  • Trace a packet's physical journey (NIC -> RAM -> CPU)
  • Deconstruct the `sk_buff` (Socket Buffer) structure
  • Analyze the SoftIRQ Bottleneck (`ksoftirqd`)
  • Explain NAPI (New API) Polling vs Interrupts
  • Tune Sysctl for 10Gbps+ Throughput

📚 Prerequisites

Before this lesson, you should understand:

Introduction

Most developers think networking happens in the “Cloud.” Systems Engineers know networking happens in a Ring Buffer.

When you send a request to a database, you aren’t just sending “data.” You are triggering a violent chain reaction of electrical interrupts, memory copies, and context switches.

This lesson traces the Physics of a Packet: The exact sequence of events from the moment a photon hits your Network Card to the moment your Node.js app fires a callback.


The Physics: The Path of Ingress (RX)

A packet arrives at the NIC (Network Interface Card). What happens next?

Phase 1: The Hardware (DMA)

The CPU is too slow to read packets one by one. Instead, the NIC uses Direct Memory Access (DMA) to write the packet directly into a pre-allocated space in RAM called the RX Ring Buffer.

  • Physics: The packet is in RAM, but the CPU doesn’t know it yet.
  • Action: The NIC fires a Hard Interrupt (IRQ) to wake up the CPU.

Phase 2: The SoftIRQ (NET_RX)

The Hard IRQ handler must be insanely fast. It cannot process TCP logic. It signals: “Hey kernel, there’s work to do,” triggers a SoftIRQ, and exits.

This keeps the CPU responsive. The heavy lifting happens later in ksoftirqd context.

Phase 3: NAPI Polling

In the old days, 10,000 packets meant 10,000 Interrupts. This caused “Receive Livelock”-the CPU spent 100% of its time handling interrupts and 0% processing data. Solution: NAPI (New API).

  • First packet -> Interrupt.
  • Kernel disables Interrupts for that NIC.
  • Kernel Polls the Ring Buffer until it is empty.
  • Re-enables Interrupts.

Deep Dive: The sk_buff (Socket Buffer)

The struct sk_buff is the most important data structure in the Linux sub-atomic universe. It represents a packet.

Crucially, Linux never copies packet data if it can avoid it. It just passes pointers to this sk_buff structure around.

The Anatomy

  • head: Start of the allocated buffer.
  • data: Start of the actual packet data (e.g., skips the Ethernet Header).
  • tail: End of the packet data.
  • end: End of the allocated buffer.

Physics of Parsing: When the kernel parses headers (Ethernet -> IP -> TCP), it doesn’t move memory. It just increments the data pointer. skb_pull(skb, 14) -> effectively “strips” the Ethernet header by moving the pointer forward 14 bytes. Zero cost.


Strategy: Zero Copy Networking

Why is read() slow? Because it forces a Context Switch AND a Memory Copy. Packet (Kernel RAM) -> App Buffer (User RAM).

The Solution: sendfile() or mmap(). These syscalls allow the Kernel to send data from Disk -> Network without ever copying it to User Space.

  • Result: The CPU does nothing. DMA handles Disk -> RAM, and RAM -> NIC.
  • Throughput: 10Gbps+ on single-core.

Code: Tuning for High Throughput

To handle 10Gbps or high packet rates, default Linux settings are insufficient.

# 1. Enlarge the Ring Buffers (NIC Hardware Queue)
# Prevents packet drops at the hardware level during micro-bursts
ethtool -G eth0 rx 4096 tx 4096

# 2. Distribute Interrupts (Packet Steering)
# Ensures not just CPU Core 0 is handling all network traffic
# (Receive Packet Steering)
sysctl -w net.core.rps_sock_flow_entries=32768

# 3. Increase SoftIRQ budget
# Allow the kernel to process more packets before yielding CPU
sysctl -w net.core.netdev_budget=600

# 4. Enlarge TCP Window limits (BDP - Bandwidth Delay Product)
sysctl -w net.ipv4.tcp_rmem="4096 87380 16777216"
sysctl -w net.ipv4.tcp_wmem="4096 65536 16777216"

Practice Exercises

Exercise 1: The Drop (Beginner)

Task: Run ethtool -S eth0 | grep drop. Observation: If rx_queue_0_drops is increasing, your Ring Buffer is too small. The kernel isn’t polling fast enough to empty the hardware queue.

Exercise 2: The SoftIRQ Storm (Intermediate)

Task: Run top and look at the %si (SoftIRQ) CPU usage field. Scenario: If %si is 100% on one core (e.g., Cpu0), you are bottlenecked by interrupt handling. Fix: Enable RPS (Receive Packet Steering) to spread the load to other cores.

Exercise 3: Zero Copy Benchmark (Advanced)

Task: Compare cat largefile > /dev/null vs sending it via sendfile() (using a tool like nginx). Result: sendfile consumes drastically less CPU because the data never crosses the User/Kernel boundary.


Knowledge Check

  1. What performs the copy from NIC to RAM?
  2. Why did the “Interrupt Livelock” happen before NAPI?
  3. What does skb_pull() actually do to the memory?
  4. Why is ksoftirqd usage high during heavy network load?
  5. What is the “Bandwidth-Delay Product”?
Answers
  1. DMA (Direct Memory Access). The CPU is not involved.
  2. CPU Starvation. The CPU spent all its cycles entering/exiting interrupt handlers, doing no actual work.
  3. Nothing. It just increments a pointer (advances the start offset).
  4. SoftIRQ Processing. This is the kernel thread dedicated to processing the backlog of packets from the Ring Buffer.
  5. Buffer Size. Throughput * RTT. The amount of data “in flight” that needs to be buffered for max speed.

Summary

  • RX Ring: The hardware parking lot.
  • SoftIRQ: The kernel worker thread.
  • sk_buff: The pointer-based packet structure.
  • Zero Copy: The art of doing nothing.

Questions about this lesson? Working on related infrastructure?

Let's discuss