The Physics of Processes: Life, Death, and Zombies

Why fork() is fast (COW), why Context Switching is slow (TLB Flush), and how the Kernel manages the illusion of multitasking.

Intermediate 45 min read Expert Version →

🎯 What You'll Learn

  • Deconstruct the `task_struct` (Process Control Block)
  • Analyze Copy-On-Write (COW) memory physics during `fork()`
  • Trace a Context Switch (Register Save + TLB Flush)
  • Debug Zombie Processes (`Z+`) using `pstree`
  • Differentiate `exec()` system calls

📚 Prerequisites

Before this lesson, you should understand:

Introduction

To a user, a Process is a window. To the Kernel, a Process is a data structure (struct task_struct) containing:

  • files: Open file descriptors.
  • mm: Memory pointers.
  • thread_info: CPU limits and capabilities.

When you run htop, you are just iterating over a linked list of these structures.


The Physics: Fork and Copy-On-Write (COW)

In early Unix, fork() was slow because it copied all memory. In Linux, fork() is instant.

The Trick: The Kernel copies the Page Tables, not the RAM. Pointer arithmetic is cheap. Both Parent and Child point to the same physical RAM, marked as “Read-Only”.

The Event: When the Child tries to write to a variable:

  1. CPU traps (Page Fault).
  2. Kernel pauses execution.
  3. Kernel copies only that specific 4KB page to a new location.
  4. Kernel marks both pages “Read-Write”.
  5. Child resumes.

Physics: This is why creating 1000 processes (e.g., Nginx workers) consumes almost zero extra RAM until they start diverging.


Deep Dive: Context Switching Overhead

Multitasking is an illusion. The CPU runs one thing at a time. Switching from Process A to Process B costs approximately 1-5 microseconds. Why?

  1. Register Swap: Save A’s registers to RAM, load B’s. (Fast).
  2. Cache Pollution: L1/L2 caches contain A’s data. B will miss every cache hit initially. (Slow).
  3. TLB Flush: The CPU forgets A’s virtual-to-physical address mappings. (Very Slow).

Impact: This is why “High Frequency Trading” avoids context switches like the plague (using CPU Pinning).


Code: The Zombie Maker

A Zombie (Z) is not a process. It is an entry in the process table. It is the Kernel saying: “This child died, and its parent hasn’t asked ‘How?’ yet.”

#include <unistd.h>
#include <stdlib.h>

int main() {
    pid_t pid = fork();

    if (pid > 0) {
        // Parent: Sleeps forever, never calling wait()
        // The child will die, but stay "Zombie" until parent exits
        while(1) sleep(1);
    } else {
        // Child: Exits immediately
        exit(0);
    }
    return 0;
}

Observation: Run this. Check ps aux | grep Z. You cannot kill -9 a Zombie. It is already dead. You must kill the Parent to clean it (re-parenting to Process 1).


Practice Exercises

Exercise 1: COW Verification (Beginner)

Scenario: Fork a process. In the child, print the pointer address of a variable. Task: Is it the same as the parent? (Yes). Does it point to the same Physical RAM? (Yes, initially).

Exercise 2: PID Exhaustion (Intermediate)

Scenario: sysctl kernel.pid_max is 32768 by default. Task: Write a fork bomb (safely, in a VM!) that exhausts PIDs. What specific error does the shell return?

Exercise 3: Context Switch Profiling (Advanced)

Task: Use perf stat -e context-switches ./myprogram to measure how many times your program was kicked off the CPU.


Knowledge Check

  1. Why is fork() faster than malloc()ing the same amount of memory?
  2. What actually happens when you Copy-On-Write?
  3. Why can’t you kill a Zombie process?
  4. How does CPU Pinning improve performance?
  5. What does Process 1 (init/systemd) do with orphans?
Answers
  1. It doesn’t allocate RAM. It only copies metadata (Page Tables).
  2. Page Fault. The hardware interrupts the kernel, which lazily duplicates the 4KB page involved.
  3. It’s already dead. It has no code, no memory, no CPU time. It is a line in a ledger.
  4. Avoids Cache/TLB flushing. By staying on one core, the cache remains “warm” with relevant data.
  5. Adoption. It calls wait() on them, cleaning up their exit status and removing the zombie entry.

Summary

  • Fork: Lazy copying (COW).
  • Context Switch: The cost of multitasking (Cache Penalties).
  • Zombies: Administrative leftovers, not memory leaks.

Questions about this lesson? Working on related infrastructure?

Let's discuss