The Physics of Namespaces: nsproxy, clone() & Isolation

How Docker tricks processes into thinking they are alone. The physics of `nsproxy`, the `clone()` bitmask, and Virtual Ethernet pairs.

Intermediate 40 min read Expert Version →

🎯 What You'll Learn

  • Deconstruct the `task_struct -> nsproxy` pointer chain
  • Decode `CLONE_NEWPID` and `CLONE_NEWNET` bitmasks
  • Trace a packet through a Virtual Ethernet Pair (`veth`)
  • Manipulate namespaces raw with `unshare` and `setns`
  • Explain User Namespace UID Remapping (Physics of Rootless)

Introduction

A Container is not a real thing. It is a lie. It is a Linux process that has been tricked into seeing a subset of the world.

The mechanism for this deception is the Namespace. While Cgroups limit how much you can use, Namespaces limit what you can see.

This lesson explores the physics of this deception: How the kernel takes a pointer in task_struct and swaps it out to create a parallel universe.


The Physics: nsproxy and clone()

Inside the kernel, specifically in sched.h, every process (task_struct) has a pointer: struct nsproxy *nsproxy;

This struct contains pointers to the 7 namespaces:

struct nsproxy {
    struct uts_namespace *uts_ns; // Hostname
    struct ipc_namespace *ipc_ns; // Shared Memory
    struct mnt_namespace *mnt_ns; // Filesystem Mounts
    struct pid_namespace *pid_ns; // Process IDs
    struct net *net_ns;           // Network Stack
    // ...
};

When you fork a standard process, the child shares these pointers with the parent. When you create a container, you use the clone() syscall with specific flags: clone(CLONE_NEWPID | CLONE_NEWNET | CLONE_NEWNS, ...)

The Physics: The kernel copies the task_struct, but instead of pointing to the old global nsproxy, it allocates a new, empty default nsproxy for the child. The child now sees an empty network stack, an empty process list, and a default hostname. It is alone.


Network Namespaces (NetNS)

The Network Namespace is the most complex because it isolates the entire TCP/IP stack: Interfaces, Routing Tables, Firewall Rules (iptables), and Sockets.

The Veth Pair Physics

How do we connect this isolated island back to the world? We use a Virtual Ethernet Pair (veth). Consider it a virtual patch cable.

  • End A (veth0): Plugged into the Host Namespace.
  • End B (veth1): Plugged into the Container Namespace.

When the container sends a packet, it enters End B. It instantly pops out of End A on the host. The Host then bridges this packet to its physical eth0 interface (NAT).


User Namespaces (UserNS)

This is the physics behind “Rootless Containers”. It works by Offset Mapping.

  • Host UID: 1000 (nikhil).
  • Container UID: 0 (root).

The User Namespace creates a mapping table: 0 inside -> 1000 outside.

When the container process calls setuid(0), the kernel checks the map: “Okay, you are technically UID 1000 on the physical hardware, but inside this namespace, I will lie and say you are UID 0”. If this process tries to touch a file on the host owned by real root (UID 0), the kernel resolves it to UID 1000 and denies access. Safe.


Code: Entering the Matrix (nsenter)

You can physically jump your shell into another process’s namespace.

# 1. Start a dummy container
docker run -d --name matrix alpine sleep 1000

# 2. Find its Physical PID
PID=$(docker inspect --format '{{.State.Pid}}' matrix)
echo "Container PID: $PID"

# 3. Enter its Network Namespace
# We use the /proc/PID/ns magic files
# The shell you run now sees the container's interfaces!
sudo nsenter --target $PID --net ip addr show

Physics observation: Look at /proc/$PID/ns/. These are “Magic Links”. They identify the namespace object in kernel memory by its Inode Number. If two processes have the same Inode for net, they are in the same network namespace.


Practice Exercises

Exercise 1: The Lie (Beginner)

Task: sudo unshare --fork --pid --mount-proc /bin/bash. Action: Run ps aux. Observation: You see only two processes: bash (PID 1) and ps. The hundreds of host processes are invisible. You are in a new PID Universe.

Exercise 2: The Wire (Intermediate)

Task: Create two NetNS. Link them with veth. Action:

ip netns add blue
ip netns add red
ip link add veth-b type veth peer name veth-r
ip link set veth-b netns blue
ip link set veth-r netns red
# Assign IPs and ping!

Physics: You built a virtual physical network inside kernel memory.

Exercise 3: UID Mapping (Advanced)

Task: Use unshare --user --map-root-user. Action: Run id. It says uid=0. Action: Run touch /tmp/test. Check file ownership from another terminal. It is owned by your real user ID, not root.


Knowledge Check

  1. What syscall flag creates a new network namespace?
  2. Does a namespace copy the parent’s resources or hide them?
  3. What connects a container networking stack to the host?
  4. Where are namespace handles exposed in the filesystem?
  5. Why is CLONE_NEWNS named so generically?
Answers
  1. CLONE_NEWNET.
  2. Hides them. It starts with a fresh default view.
  3. Veth Pair. (Virtual Ethernet).
  4. /proc/[pid]/ns/.
  5. History. It was the first namespace (Mount Namespace) in 2002. They didn’t know there would be more.

Summary

  • Namespaces: Visibility Barriers.
  • nsproxy: The pointer to isolation.
  • Veth: The bridge across worlds.
  • Container: clone(FLAGS).

Questions about this lesson? Working on related infrastructure?

Let's discuss