Load Balancing: The Physics of Queues

Why adding servers doesn't always make things faster. Little's Law, the Thundering Herd, and Layer 7 Traffic Shaping.

Beginner 40 min read Expert Version →

🎯 What You'll Learn

  • Apply Little's Law ($L = \lambda W$) to system capacity
  • Differentiate L4 (Packet) vs L7 (Request) Load Balancing
  • Mitigate the 'Thundering Herd' problem
  • Configure Nginx Upstream blocks
  • Analyze Sticky Sessions vs Stateless Routing

📚 Prerequisites

Before this lesson, you should understand:

Introduction

A Load Balancer is a traffic cop. It is a Queue Manager.

Every server is a queue.

  • The CPU has a Run Queue.
  • The Network Card has a Ring Buffer.
  • The Database has a Lock Queue.

If you understand Queueing Theory, you understand Load Balancing. If you don’t, you add servers until you go bankrupt.


The Physics: Little’s Law

The fundamental law of system capacity is:

L=λWL = \lambda W

  • LL: Average number of items in the system (Queue Length).
  • λ\lambda: Average arrival rate (Requests per Second).
  • WW: Average wait time (Latency).

The Insight: If Latency (WW) doubles (e.g., Database slows down), then Queue Length (LL) doubles-even if traffic (λ\lambda) stays constant. Your Load Balancer’s job is to detect this and stop sending requests to the slow server before it crashes.


L4 vs. L7: The Layers of Traffic

How deep does the Load Balancer look?

Layer 4 (Transport)

  • What it sees: IP + Port. “Packet from 1.2.3.4 to 5.6.7.8”.
  • Action: Forwards packets. Fast. Dumb.
  • Example: LVS, Maglev.

Layer 7 (Application)

  • What it sees: HTTP Headers, Cookies, URL. “GET /api/user?id=5”.
  • Action: Terminates TCP, reads request, opens new TCP to backend. Smart. Slow.
  • Example: Nginx, HAProxy, AWS ALB.

Deep Dive: The Thundering Herd

Imagine 10,000 users are waiting for a cache entry. The entry expires. Suddenly, 10,000 requests hit the Backend DB simultaneously. The DB crashes. The LB retries. The DB stays dead.

Solution: Shepherd the Herd.

  1. Request Coalescing: The LB holds 9,999 requests, sends one to the backend, and serves the result to all 10,000.
  2. Jitter: processing requests with slight random delays to desynchronize spikes.

Code: Weighted Round Robin

A simple Round Robin is often not enough. You need Weights.

class WeightedRR:
    def __init__(self, servers):
        # servers = {"srv1": 5, "srv2": 1, "srv3": 1}
        self.servers = servers
        self.state = {k: 0 for k in servers}

    def get_server(self):
        # Find server with highest (CurrentWeight + EffectiveWeight)
        # (This is a simplified Nginx-like algorithm)
        best = None
        total_weight = 0
        
        for srv, weight in self.servers.items():
            self.state[srv] += weight
            total_weight += weight
            
            if best is None or self.state[srv] > self.state[best]:
                best = srv
        
        self.state[best] -= total_weight
        return best

# Physics: This ensures "Smooth" distribution, not "Bursty" distribution.
# Srv1 doesn't get 5 requests in a row; it's interleaved.

Practice Exercises

Exercise 1: Capacity Planning (Beginner)

Scenario:

  • You process 1000 Req/Sec (λ\lambda).
  • Avg Latency is 0.5 Sec (WW). Task: According to Little’s Law, how many concurrent connections (LL) must your server support?

Exercise 2: Nginx Config (Intermediate)

Task: Configure an Nginx upstream block that:

  1. Loads balances 3 servers.
  2. Send 2x traffic to srv_heavy.
  3. Marks a server “down” if it fails 3 times.

Exercise 3: Sticky Sessions (Advanced)

Scenario: A user logs in on Server A. Their session is in Server A’s RAM. Task: Why does Round Robin break this? How does ip_hash fix it? What is the downside of ip_hash during a DDoS attack?


Knowledge Check

  1. What happens to system capacity if latency increases?
  2. Why is L7 Load Balancing slower than L4?
  3. What is the “Thundering Herd” problem?
  4. Why do we need health checks?
  5. Does adding more servers always fix high latency?
Answers
  1. Capacity Drops. (Or Queue Queue Length explodes). L=λWL = \lambda W. If W goes up, L goes up.
  2. Context Switching. L7 requires terminating the TCP connection, buffering the request, parsing headers, and creating a new connection. L4 just rewrites packets.
  3. Massive concurrency. When a cache misses, all concurrent requests hit the DB at once.
  4. To avoid Black Holes. Sending traffic to a dead server results in 100% error rates.
  5. No. If the bottleneck is the Database (shared resource), adding more Web Servers just increases the queue pressure on the DB.

Summary

  • Little’s Law: Latency kills throughput.
  • Algorithms: Use Weighted Round Robin for heterogeneity.
  • Layers: L4 for speed, L7 for intelligence.

Questions about this lesson? Working on related infrastructure?

Let's discuss