Observability: The Physics of Seeing

Why you can't debug what you can't see. Metrics, Logs, Traces, and the Heisenberg Uncertainty Principle of monitoring.

Beginner 35 min read Expert Version →

🎯 What You'll Learn

  • Differentiate Logs (Cardinality) vs Metrics (Aggregatable)
  • Deconstruct the 'Observer Effect' in tracing overhead
  • Design Histogram Buckets for Latency (SLO tracking)
  • Trace a Distributed Request via SpanContext
  • Analyze the cost of High Cardinality

📚 Prerequisites

Before this lesson, you should understand:

Introduction

In a monolith, debugging is tail -f /var/log/syslog. In a distributed system, that log file doesn’t exist.

Your request hit 50 microservices. One of them failed. Which one? If you don’t have Observability, you are flying blind in a storm.

Observability is not “Monitoring” (Checking if the server is up). Observability is asking arbitrary questions about your system without shipping new code.


The Physics: Sampling & Overhead

The Heisenberg Uncertainty Principle applies to systems: Measuring the system changes the system.

  • Logging: Writes to disk (I/O blocking).
  • Tracing: Adds headers and network calls (Latency).
  • Sidecars: Consume CPU/Memory (Resource contention).

The Solution: Sampling. You don’t trace 100% of requests. You trace 0.1% (Head Sampling) or keeps only the “Interesting” errors (Tail Sampling).


The Three Pillars (And When They Break)

1. Metrics (The Dashboard)

  • Physics: Aggregatable numbers. Count, Gauge, Histogram.
  • Superpower: Cheap. Storing “1 Million Requests” takes the same space as “1 Request”.
  • Kryptonite: High Cardinality.
    • Good Label: status="200" (Low cardinality).
    • Bad Label: user_id="847382" (High cardinality).
    • Result: Your Prometheus server crashes trying to index 1 million time series.

2. Logs (The Truth)

  • Physics: High-fidelity event records.
  • Superpower: Infinite detail. “User 5 bought Item 9”.
  • Kryptonite: Cost. Indexing 1TB of logs/day is expensive.

3. Traces (The Story)

  • Physics: Causal chains. ParentID -> SpanID.
  • Superpower: Finding the bottleneck. “Why did this take 2 seconds? Oh, the Redis Cache miss took 1.9s.”
  • Kryptonite: Broken Context. If one middleware drops the headers, the trace breaks.

Code: The Histogram

The most misunderstood metric type. How do you calculate “99th Percentile Latency” across 100 servers? You can’t average averages. You must use Buckets.

# Prometheus conceptual implementation
class Histogram:
    def __init__(self):
        # Buckets define the resolution
        self.buckets = {
            0.1: 0, # < 100ms
            0.5: 0, # < 500ms
            1.0: 0, # < 1s
            float("inf"): 0
        }
        self.sum = 0
        self.count = 0

    def observe(self, value):
        self.sum += value
        self.count += 1
        for bound in self.buckets:
            if value <= bound:
                self.buckets[bound] += 1

# When querying: 
# "histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))"
# It mathematically interpolates the buckets to estimate the value.

Deep Dive: Push (Telegraf) vs Pull (Prometheus)

  • Push: Agent sends data to central server.
    • Pros: Good for short-lived jobs (Lambda), bypasses firewalls.
    • Cons: Can DDoS the server if many agents push at once.
  • Pull: Server scrapes targets.
    • Pros: Server controls the load. Explicit inventory.
    • Cons: Need to discover targets (Service Discovery).

Industry Standard: Pull (Prometheus) is the winner for Kubernetes.


Practice Exercises

Exercise 1: Cardinality Explosion (Beginner)

Scenario: You add a label client_ip to your http_requests_total metric. Task: Why does your monitoring bill suddenly increase by 100x? (Hint: How many unique IPs hit your site?).

Exercise 2: Sampling Rates (Intermediate)

Scenario: 10,000 Req/Sec. Tracing adds 1ms overhead per request. Task: If you sample 100% of requests, how much total CPU time is wasted per second on tracing? What is a safe sampling rate?

Exercise 3: SLO Calculation (Advanced)

Scenario: SLO: “99% of requests < 200ms”. Task: Write a PromQL query using http_request_duration_bucket to check if this SLO is being met.


Knowledge Check

  1. Why can’t you calculate P99 from avg_latency metrics?
  2. What is a “Span Context”?
  3. Why is logging “User ID” in a Metric label dangerous?
  4. How does Tail Sampling differ from Head Sampling?
  5. Which pillar is best for debugging a specific customer complaint?
Answers
  1. Math. Averages hide outliers. P99 requires distribution data (Histograms).
  2. Metadata. TraceID and ParentSpanID passed in HTTP Headers to Correlate services.
  3. High Cardinality. It creates infinite time series, crashing the database.
  4. Decision Timing. Head decides before the request starts (random). Tail decides after it ends (keep only errors/slow ones).
  5. Logs. (Or Traces if sampled). Metrics won’t show you the specific user’s error message.

Summary

  • You get what you pay for. Logs are expensive but detailed. Metrics are cheap but vague.
  • Cardinality kills. Watch your labels.
  • Context is King. A trace without context is just noise.

Questions about this lesson? Working on related infrastructure?

Let's discuss