The Physics of Nodes: State Tries & Pruning
Why Archive Nodes are 15TB. The physics of Merkle Patricia Tries, State Pruning, and how Light Clients verify data without storing it.
🎯 What You'll Learn
- Deconstruct the Merkle Patricia Trie (MPT)
- Analyze State Pruning (Full Node vs Archive Node)
- Trace a Light Client Verification (Merkle Root)
- Calculate the Storage Cost of State Growth
- Compare Geth (Go) vs Reth (Rust) Architecture
📚 Prerequisites
Before this lesson, you should understand:
Introduction
A Blockchain is a State Machine, not a ledger. The “State” is the current balance of every account and the value of every smart contract variable. To run a node, you don’t just download blocks. You must recompute the State by executing every transaction from Genesis.
This lesson explores the physics of data storage in Decentralized Systems.
The Physics: Merkle Patricia Trie (MPT)
How do you store 200 Million accounts so that any balance can be verified in ? The Merkle Patricia Trie.
It is a tree structure where the Root Hash fingerprints the entire state of the universe.
- Key:
keccak256(address) - Value:
RLP(Nonce, Balance, StorageRoot, CodeHash)
Physics: If you change 1 bit of storage in a smart contract:
- The Storage Trie Root changes.
- The Account State changes.
- The Global State Root changes. This “Avalanche Effect” ensures integrity but requires massive I/O to update the disk.
Deep Dive: Full vs Archive (The Pruning)
Why is an Archive Node 15TB but a Full Node only 1TB? Pruning.
- Archive Node: Stores the State Trie for every block from Genesis.
- “What was Vitalic’s balance at Block 5,000,000?” -> Query Trie at Root .
- Full Node: Stores only the Current State Trie (Head) and usually the last 128 blocks.
- “What was Vitalic’s balance at Block 5,000,000?” -> Error: Missing Trie Node.
Physics: Both nodes verify every transaction. A Full Node just deletes the old data after verification to save space.
Strategy: Light Clients (Trustless Verification)
How can a mobile phone verify a transaction without downloading 1TB? Merkle Proofs.
- Phone: “Hey Full Node, did Account A send 5 ETH?”
- Full Node: “Yes. Here is the transaction, and here is the ‘Branch’ of hashes leading to the Block Header.”
- Phone: Hashes the branch. If the result matches the Block Header (which the phone trusts via PoW/PoS), the data must be true.
Physics: The Light Client downloads ~10MB of headers. It provides Mathematical Certainty of inclusion without Data Availability.
Architecture: Geth vs Reth
- Geth (Go): Uses LevelDB/Pebble. Single-threaded execution (historically). Heavy on RAM.
- Reth (Rust): Uses MDBX. Staged Sync (Seduction). massive parallelism in I/O.
- Result: Reth syncs an Archive Node in 50 hours (vs weeks for Geth).
Code: Verifying a Merkle Proof
import hashlib
def keccak256(data):
# Pseudo-code for Keccak
return hashlib.sha3_256(data).digest()
def verify_proof(root_hash, target_data, proof_path):
current_hash = keccak256(target_data)
for sibling_hash, direction in proof_path:
if direction == 'left':
# Hash(Sibling + Current)
current_hash = keccak256(sibling_hash + current_hash)
else:
# Hash(Current + Sibling)
current_hash = keccak256(current_hash + sibling_hash)
return current_hash == root_hash
Practice Exercises
Exercise 1: State Growth (Beginner)
Scenario: 1 Million new accounts created per day. Each account is 100 bytes. Task: Calculate monthly state growth. (). But overhead (Trie Nodes) makes it larger.
Exercise 2: Pruning (Intermediate)
Scenario: You run a node with --gcmode=full.
Task: You try to query eth_call on a block from last year.
Result: “Error: computations for old blocks are not implemented.” (The state is gone).
Exercise 3: Sync Time (Advanced)
Scenario: SSD IOPS = 10,000. Total Trie Nodes = 1 Billion. Task: Estimate Sync Time. (Requires random reads. 1B / 10k = 100,000 seconds = ~27 hours).
Knowledge Check
- What is the difference between specific a Full Node and an Archive Node?
- What does a Merkle Root represent?
- Why do Light Clients trust Block Headers?
- Why is I/O the bottleneck for nodes?
- What is “State Pruning”?
Answers
- History. Archive keeps all historical states; Full keeps only the tip.
- Summary. A cryptographic fingerprint of the entire dataset.
- PoS/PoW. The cost to fake a header is billions of dollars (Consensus).
- Random Access. Updating the Trie requires jumping around the disk (random writes).
- Deletion. Removing old Merkle Nodes that are no longer referenced by the HEAD block.
Summary
- State: Stored in Tries.
- Archive: Stores all Tries.
- Light: Verifies Roots.
Questions about this lesson? Working on related infrastructure?
Let's discuss