The Physics of Cgroups: Resource Controllers & OOM

How Docker limits RAM. The physics of the CFS Scheduler, Hierarchical Token Buckets, and the OOM Killer's linear search.

Beginner 40 min read Expert Version →

🎯 What You'll Learn

  • Deconstruct the Cgroup V2 Unified Hierarchy
  • Calculate CPU Shares vs Quotas (Physics of Throttling)
  • Trace the OOM Killer's decision path (oom_score)
  • Implement IO Throttling (Read/Write BPS)
  • Debug 'Throttled Time' in production containers

Introduction

In the old days, a single runaway process could crash a server by eating all 64GB of RAM. Today, we run 500 containers on that same server logic. How? Control Groups (Cgroups).

Cgroups are not just “limits.” They are Accounting Ledgers built deep into the Kernel’s core subsystems (Scheduler, Memory Manager, Block Layer). When the scheduler picks a task, it doesn’t just check priority; it checks the Cgroup Quota.

This lesson explores the physics of these limits-from CPU throttling to the dreaded OOM termination.


Cgroup V2: The Unified Hierarchy

History Check:

  • V1 (Legacy): CPU, Memory, and IO had separate directory trees. It was a mess.
  • V2 (Modern): One tree to rule them all. /sys/fs/cgroup.

When you create a Cgroup, you create a directory. Inside that directory, you manipulate files to talk to the kernel controllers.

The Physics of Controllers

Every Cgroup has a cgroup.controllers file. You write +memory or +cpu to cgroup.subtree_control to enable features for child groups. This activates the “Accounting Hooks” in the kernel. Latency Note: Enabling memory accounting adds a tiny CPU overhead to every memory allocation (updating counters).


CPU Controller: Shares vs Quota

There are two ways to control CPU, and they use completely different physics.

1. CPU Shares (cpu.weight)

  • Mechanism: Proportional Weight.
  • Physics: If Group A has 100 shares and Group B has 100 shares, and both want 1000% CPU, they get 50% each.
  • Key: If Group B is idle, Group A can use 100%. It is work-conserving. No CPU cycles are wasted.

2. CPU Quota (cpu.max)

  • Mechanism: Hard Wall.
  • Physics: “You get 50ms of runtime every 100ms period.”
  • Danger: If you burn your 50ms in the first 10ms, your process is Throttled (paused) for the remaining 90ms. This causes massive tail latency.
  • Metric: Check nr_throttled in cpu.stat.

Memory Controller: The OOM Killer

Memory limits are binary. You have it, or you don’t.

1. The Limit (memory.max)

If a process tries to malloc past this line:

  1. Kernel scans for reclaimable memory (Caches).
  2. If it can free cache, it does.
  3. If it cannot: OOM Killer.

2. The Physics of OOM

When the limit is hit, the kernel pauses the allocating process. It calculates an oom_score for every process in that cgroup.

  • Algorithm: Usage / Total Limit.
  • Victim Selection: The process with the highest score dies.
  • Action: SIGKILL (Force Kill).

Protection: unique to Linux, your important processes can set oom_score_adj to -1000 to become immortal.


IO Controller: Bandwidth Shaping

Block IO (Disk) is shared. Cgroups use a mechanism usually called a Token Bucket Filter.

  • io.max: “Read 10MB/s”.
  • If you read 20MB in 1 second, the kernel sleeps your IO thread for 1 second to average it out.

Code: Creating a Cgroup Manually

Let’s do what Docker does, but raw.

# 1. Create a new Cgroup
sudo mkdir /sys/fs/cgroup/sandbox

# 2. Set a Memory Limit (10MB)
echo 10485760 | sudo tee /sys/fs/cgroup/sandbox/memory.max

# 3. Add current shell to the Cgroup
# CAUTION: Your shell is now severely limited!
echo $$ | sudo tee /sys/fs/cgroup/sandbox/cgroup.procs

# 4. Verify
cat /proc/self/cgroup

Practice Exercises

Exercise 1: The OOM (Beginner)

Task: Create a cgroup limited to 5MB. Action: Run python3 -c "x = 'a' * 10 * 1024 * 1024". Result: “Killed”. Check dmesg to see the OOM Killer’s murder report (“Memory cgroup out of memory”).

Exercise 2: CPU Throttling (Intermediate)

Task: Limit cgroup to 10% CPU (cpu.max = 10000 100000). Action: Run while :; do :; done. Observation: It runs, but top shows it stuck at exactly 10%. Checking cpu.stat shows massive throttling counts.

Exercise 3: Hierarchy (Advanced)

Task: Create /sys/fs/cgroup/parent and /sys/fs/cgroup/parent/child. Action: Enable memory controller in parent. Physics: Note how constraints flow down. Limits in parent cap the child, even if child has a higher limit set.


Knowledge Check

  1. What happens to unused CPU cycles in cpu.weight mode?
  2. Does memory.max cause throttling or killing?
  3. Why is Cgroup V2 preferred over V1?
  4. How do you protect a critical process from OOM Kill?
  5. What file do you write to to add a process to a cgroup?
Answers
  1. They are distributed. Neighbors can use them.
  2. Killing (OOM). Though reclaim happens first.
  3. Unified Hierarchy. Prevents conflict and race conditions between controllers.
  4. oom_score_adj. Set it to -1000.
  5. cgroup.procs.

Summary

  • CPU: Shares (Soft) vs Quota (Hard).
  • Memory: Max Limit = OOM Risk.
  • V2: Unified Tree.
  • Physics: Kernel Accounting Hooks on every alloc/sched event.

Questions about this lesson? Working on related infrastructure?

Let's discuss