Distributed Systems: The Physics of Information

Why 'now' doesn't exist. Coping with the speed of light, network partitions, and the CAP Theorem.

Beginner 45 min read Expert Version →

🎯 What You'll Learn

  • Deconstruct the CAP Theorem as a physical law
  • Implement Logical Clocks (Lamport/Vector)
  • Visualize Consistent Hashing Rings
  • Analyze Quorum Consensus (R + W > N)
  • Trace a Network Partition failure mode

📚 Prerequisites

Before this lesson, you should understand:

Introduction

In a single computer, time is absolute (the CPU clock). In a distributed system, Time is a partial ordering of events.

This is not only an engineering problem; it is a relavistic one. Information cannot travel faster than the speed of light (or fiber optics). Therefore, Instant Consistency is physically impossible over distance.

In this lesson, we learn how to survive in a universe where machines lie, networks fail, and clocks drift.


The Physics: CAP Theorem

The CAP Theorem is often misunderstood as “Pick 2”. It is actually: “In the event of a Partition (P), do you choose Consistency (C) or Availability (A)?”

The Split Brain

Imagine a database with two nodes: N1N_1 (US) and N2N_2 (EU). The cable between them is cut.

  • User writes “X = 1” to N1N_1.
  • User reads “X” from N2N_2.

The Choice:

  1. Availability (AP): N2N_2 returns “X = 0” (Old data). The system works, but lies.
  2. Consistency (CP): N2N_2 returns “Error”. The system is down, but truthful.

Bitcon and Ethereum are AP (Chain forks temporarily). Traditional Banks are CP (ATMs stop working).


Deep Dive: Time & Consensus

Since wall-clocks drift (NTP is only accurate to ms), we cannot rely on timestamps to order events. We use Logical Clocks.

Lamport Clocks

Every time a process does something, it increments a counter TT. When it sends a message, it includes TT. Is process P2P_2 receives a message with T=5T=5 and its own clock is T=3T=3, it updates its clock to max(3,5)+1=6max(3, 5) + 1 = 6. Physics: This ensures causality. Effect follows Cause.

Consistent Hashing (The Ring)

How do you distribute keys across 100 servers so that adding 1 server doesn’t require moving all data? Map both Servers and Keys to a Circle (0-360°). Key KK is stored on the first Server SS encountered moving clockwise.


Code: The Hash Ring

import hashlib

class ConsistentHash:
    def __init__(self, replicas=3):
        self.replicas = replicas
        self.ring = {}
        self.sorted_keys = []

    def _hash(self, key):
        return int(hashlib.md5(key.encode()).hexdigest(), 16)

    def add_node(self, node):
        for i in range(self.replicas):
            key = self._hash(f"{node}:{i}")
            self.ring[key] = node
            self.sorted_keys.append(key)
        self.sorted_keys.sort()

    def get_node(self, resource):
        if not self.ring: return None
        key = self._hash(resource)
        
        # Find first node to right on ring
        for node_key in self.sorted_keys:
            if key <= node_key:
                return self.ring[node_key]
        
        # Wrap around
        return self.ring[self.sorted_keys[0]]

Practice Exercises

Exercise 1: The Quorum Formula (Beginner)

Scenario: You have 5 replicas (N=5N=5). Task: To ensure Strong Consistency, how many nodes must you Write to (WW) and Read from (RR)? (Hint: R+W>NR + W > N).

Exercise 2: Vector Clocks (Intermediate)

Scenario:

  • Node A: [1, 0, 0]
  • Node B: [0, 1, 0]
  • Node A sends message to B. Task: What is the vector clock of B after receiving the message?

Exercise 3: Partition Recovery (Advanced)

Scenario: An AP system (Cassandra) heals after a split. Node A has “X=1”, Node B has “X=2”. Task: How does it resolve the conflict? (Last-Write-Wins vs. Vector Clock merging).


Knowledge Check

  1. Why is NTP (Network Time Protocol) insufficient for database ordering?
  2. What does “Eventual Consistency” really mean?
  3. In a Consistent Hash Ring, what happens when a node crashes?
  4. Why do blockchains favor Availability (AP) over Consistency (CP)?
  5. What is a “Split Brain” scenario?
Answers
  1. Drift & Uncertainty. NTP can be off by milliseconds. In high-frequency systems, this causes data corruption.
  2. Convergent State. If writes stop, all nodes will eventually agree. Until then, you may read stale data.
  3. Neighbor takeover. Only the keys belonging to the crashed node are remapped to its clockwise neighbor.
  4. Liveness guarantee. It is better to accept payments and risk a reorg (rare) than to halt the global economy whenever a wire is cut.
  5. Two leaders. When a cluster splits, both sides elect a leader and accept writes, creating diverging histories that are hard to merge.

Summary

  • CAP: Physics limit. Choose P + (A or C).
  • Time: Use Logical Clocks, not Wall Clocks.
  • Distribution: Consistent Hashing minimizes reshuffling.

Questions about this lesson? Working on related infrastructure?

Let's discuss