Distributed Systems: The Physics of Information
Why 'now' doesn't exist. Coping with the speed of light, network partitions, and the CAP Theorem.
🎯 What You'll Learn
- Deconstruct the CAP Theorem as a physical law
- Implement Logical Clocks (Lamport/Vector)
- Visualize Consistent Hashing Rings
- Analyze Quorum Consensus (R + W > N)
- Trace a Network Partition failure mode
📚 Prerequisites
Before this lesson, you should understand:
Introduction
In a single computer, time is absolute (the CPU clock). In a distributed system, Time is a partial ordering of events.
This is not only an engineering problem; it is a relavistic one. Information cannot travel faster than the speed of light (or fiber optics). Therefore, Instant Consistency is physically impossible over distance.
In this lesson, we learn how to survive in a universe where machines lie, networks fail, and clocks drift.
The Physics: CAP Theorem
The CAP Theorem is often misunderstood as “Pick 2”. It is actually: “In the event of a Partition (P), do you choose Consistency (C) or Availability (A)?”
The Split Brain
Imagine a database with two nodes: (US) and (EU). The cable between them is cut.
- User writes “X = 1” to .
- User reads “X” from .
The Choice:
- Availability (AP): returns “X = 0” (Old data). The system works, but lies.
- Consistency (CP): returns “Error”. The system is down, but truthful.
Bitcon and Ethereum are AP (Chain forks temporarily). Traditional Banks are CP (ATMs stop working).
Deep Dive: Time & Consensus
Since wall-clocks drift (NTP is only accurate to ms), we cannot rely on timestamps to order events. We use Logical Clocks.
Lamport Clocks
Every time a process does something, it increments a counter . When it sends a message, it includes . Is process receives a message with and its own clock is , it updates its clock to . Physics: This ensures causality. Effect follows Cause.
Consistent Hashing (The Ring)
How do you distribute keys across 100 servers so that adding 1 server doesn’t require moving all data? Map both Servers and Keys to a Circle (0-360°). Key is stored on the first Server encountered moving clockwise.
Code: The Hash Ring
import hashlib
class ConsistentHash:
def __init__(self, replicas=3):
self.replicas = replicas
self.ring = {}
self.sorted_keys = []
def _hash(self, key):
return int(hashlib.md5(key.encode()).hexdigest(), 16)
def add_node(self, node):
for i in range(self.replicas):
key = self._hash(f"{node}:{i}")
self.ring[key] = node
self.sorted_keys.append(key)
self.sorted_keys.sort()
def get_node(self, resource):
if not self.ring: return None
key = self._hash(resource)
# Find first node to right on ring
for node_key in self.sorted_keys:
if key <= node_key:
return self.ring[node_key]
# Wrap around
return self.ring[self.sorted_keys[0]]
Practice Exercises
Exercise 1: The Quorum Formula (Beginner)
Scenario: You have 5 replicas (). Task: To ensure Strong Consistency, how many nodes must you Write to () and Read from ()? (Hint: ).
Exercise 2: Vector Clocks (Intermediate)
Scenario:
- Node A: [1, 0, 0]
- Node B: [0, 1, 0]
- Node A sends message to B. Task: What is the vector clock of B after receiving the message?
Exercise 3: Partition Recovery (Advanced)
Scenario: An AP system (Cassandra) heals after a split. Node A has “X=1”, Node B has “X=2”. Task: How does it resolve the conflict? (Last-Write-Wins vs. Vector Clock merging).
Knowledge Check
- Why is NTP (Network Time Protocol) insufficient for database ordering?
- What does “Eventual Consistency” really mean?
- In a Consistent Hash Ring, what happens when a node crashes?
- Why do blockchains favor Availability (AP) over Consistency (CP)?
- What is a “Split Brain” scenario?
Answers
- Drift & Uncertainty. NTP can be off by milliseconds. In high-frequency systems, this causes data corruption.
- Convergent State. If writes stop, all nodes will eventually agree. Until then, you may read stale data.
- Neighbor takeover. Only the keys belonging to the crashed node are remapped to its clockwise neighbor.
- Liveness guarantee. It is better to accept payments and risk a reorg (rare) than to halt the global economy whenever a wire is cut.
- Two leaders. When a cluster splits, both sides elect a leader and accept writes, creating diverging histories that are hard to merge.
Summary
- CAP: Physics limit. Choose P + (A or C).
- Time: Use Logical Clocks, not Wall Clocks.
- Distribution: Consistent Hashing minimizes reshuffling.
Questions about this lesson? Working on related infrastructure?
Let's discuss