The Physics of Nodes: State Tries & Pruning

Why Archive Nodes are 15TB. The physics of Merkle Patricia Tries, State Pruning, and how Light Clients verify data without storing it.

Intermediate 40 min read Expert Version →

🎯 What You'll Learn

  • Deconstruct the Merkle Patricia Trie (MPT)
  • Analyze State Pruning (Full Node vs Archive Node)
  • Trace a Light Client Verification (Merkle Root)
  • Calculate the Storage Cost of State Growth
  • Compare Geth (Go) vs Reth (Rust) Architecture

📚 Prerequisites

Before this lesson, you should understand:

Introduction

A Blockchain is a State Machine, not a ledger. The “State” is the current balance of every account and the value of every smart contract variable. To run a node, you don’t just download blocks. You must recompute the State by executing every transaction from Genesis.

This lesson explores the physics of data storage in Decentralized Systems.


The Physics: Merkle Patricia Trie (MPT)

How do you store 200 Million accounts so that any balance can be verified in O(logn)O(\log n)? The Merkle Patricia Trie.

It is a tree structure where the Root Hash fingerprints the entire state of the universe.

  • Key: keccak256(address)
  • Value: RLP(Nonce, Balance, StorageRoot, CodeHash)

Physics: If you change 1 bit of storage in a smart contract:

  1. The Storage Trie Root changes.
  2. The Account State changes.
  3. The Global State Root changes. This “Avalanche Effect” ensures integrity but requires massive I/O to update the disk.

Deep Dive: Full vs Archive (The Pruning)

Why is an Archive Node 15TB but a Full Node only 1TB? Pruning.

  • Archive Node: Stores the State Trie for every block from Genesis.
    • “What was Vitalic’s balance at Block 5,000,000?” -> Query Trie at Root R5MR_{5M}.
  • Full Node: Stores only the Current State Trie (Head) and usually the last 128 blocks.
    • “What was Vitalic’s balance at Block 5,000,000?” -> Error: Missing Trie Node.

Physics: Both nodes verify every transaction. A Full Node just deletes the old data after verification to save space.


Strategy: Light Clients (Trustless Verification)

How can a mobile phone verify a transaction without downloading 1TB? Merkle Proofs.

  1. Phone: “Hey Full Node, did Account A send 5 ETH?”
  2. Full Node: “Yes. Here is the transaction, and here is the ‘Branch’ of hashes leading to the Block Header.”
  3. Phone: Hashes the branch. If the result matches the Block Header (which the phone trusts via PoW/PoS), the data must be true.

Physics: The Light Client downloads ~10MB of headers. It provides Mathematical Certainty of inclusion without Data Availability.


Architecture: Geth vs Reth

  • Geth (Go): Uses LevelDB/Pebble. Single-threaded execution (historically). Heavy on RAM.
  • Reth (Rust): Uses MDBX. Staged Sync (Seduction). massive parallelism in I/O.
    • Result: Reth syncs an Archive Node in 50 hours (vs weeks for Geth).

Code: Verifying a Merkle Proof

import hashlib

def keccak256(data):
    # Pseudo-code for Keccak
    return hashlib.sha3_256(data).digest()

def verify_proof(root_hash, target_data, proof_path):
    current_hash = keccak256(target_data)
    
    for sibling_hash, direction in proof_path:
        if direction == 'left':
            # Hash(Sibling + Current)
            current_hash = keccak256(sibling_hash + current_hash)
        else:
            # Hash(Current + Sibling)
            current_hash = keccak256(current_hash + sibling_hash)
            
    return current_hash == root_hash

Practice Exercises

Exercise 1: State Growth (Beginner)

Scenario: 1 Million new accounts created per day. Each account is 100 bytes. Task: Calculate monthly state growth. (1M×30×100B=3GB1M \times 30 \times 100B = 3GB). But overhead (Trie Nodes) makes it 10x10x larger.

Exercise 2: Pruning (Intermediate)

Scenario: You run a node with --gcmode=full. Task: You try to query eth_call on a block from last year. Result: “Error: computations for old blocks are not implemented.” (The state is gone).

Exercise 3: Sync Time (Advanced)

Scenario: SSD IOPS = 10,000. Total Trie Nodes = 1 Billion. Task: Estimate Sync Time. (Requires random reads. 1B / 10k = 100,000 seconds = ~27 hours).


Knowledge Check

  1. What is the difference between specific a Full Node and an Archive Node?
  2. What does a Merkle Root represent?
  3. Why do Light Clients trust Block Headers?
  4. Why is I/O the bottleneck for nodes?
  5. What is “State Pruning”?
Answers
  1. History. Archive keeps all historical states; Full keeps only the tip.
  2. Summary. A cryptographic fingerprint of the entire dataset.
  3. PoS/PoW. The cost to fake a header is billions of dollars (Consensus).
  4. Random Access. Updating the Trie requires jumping around the disk (random writes).
  5. Deletion. Removing old Merkle Nodes that are no longer referenced by the HEAD block.

Summary

  • State: Stored in Tries.
  • Archive: Stores all Tries.
  • Light: Verifies Roots.

Questions about this lesson? Working on related infrastructure?

Let's discuss