Observability: The Physics of Seeing
Why you can't debug what you can't see. Metrics, Logs, Traces, and the Heisenberg Uncertainty Principle of monitoring.
🎯 What You'll Learn
- Differentiate Logs (Cardinality) vs Metrics (Aggregatable)
- Deconstruct the 'Observer Effect' in tracing overhead
- Design Histogram Buckets for Latency (SLO tracking)
- Trace a Distributed Request via SpanContext
- Analyze the cost of High Cardinality
📚 Prerequisites
Before this lesson, you should understand:
Introduction
In a monolith, debugging is tail -f /var/log/syslog.
In a distributed system, that log file doesn’t exist.
Your request hit 50 microservices. One of them failed. Which one? If you don’t have Observability, you are flying blind in a storm.
Observability is not “Monitoring” (Checking if the server is up). Observability is asking arbitrary questions about your system without shipping new code.
The Physics: Sampling & Overhead
The Heisenberg Uncertainty Principle applies to systems: Measuring the system changes the system.
- Logging: Writes to disk (I/O blocking).
- Tracing: Adds headers and network calls (Latency).
- Sidecars: Consume CPU/Memory (Resource contention).
The Solution: Sampling. You don’t trace 100% of requests. You trace 0.1% (Head Sampling) or keeps only the “Interesting” errors (Tail Sampling).
The Three Pillars (And When They Break)
1. Metrics (The Dashboard)
- Physics: Aggregatable numbers.
Count,Gauge,Histogram. - Superpower: Cheap. Storing “1 Million Requests” takes the same space as “1 Request”.
- Kryptonite: High Cardinality.
- Good Label:
status="200"(Low cardinality). - Bad Label:
user_id="847382"(High cardinality). - Result: Your Prometheus server crashes trying to index 1 million time series.
- Good Label:
2. Logs (The Truth)
- Physics: High-fidelity event records.
- Superpower: Infinite detail. “User 5 bought Item 9”.
- Kryptonite: Cost. Indexing 1TB of logs/day is expensive.
3. Traces (The Story)
- Physics: Causal chains.
ParentID -> SpanID. - Superpower: Finding the bottleneck. “Why did this take 2 seconds? Oh, the Redis Cache miss took 1.9s.”
- Kryptonite: Broken Context. If one middleware drops the headers, the trace breaks.
Code: The Histogram
The most misunderstood metric type. How do you calculate “99th Percentile Latency” across 100 servers? You can’t average averages. You must use Buckets.
# Prometheus conceptual implementation
class Histogram:
def __init__(self):
# Buckets define the resolution
self.buckets = {
0.1: 0, # < 100ms
0.5: 0, # < 500ms
1.0: 0, # < 1s
float("inf"): 0
}
self.sum = 0
self.count = 0
def observe(self, value):
self.sum += value
self.count += 1
for bound in self.buckets:
if value <= bound:
self.buckets[bound] += 1
# When querying:
# "histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))"
# It mathematically interpolates the buckets to estimate the value.
Deep Dive: Push (Telegraf) vs Pull (Prometheus)
- Push: Agent sends data to central server.
- Pros: Good for short-lived jobs (Lambda), bypasses firewalls.
- Cons: Can DDoS the server if many agents push at once.
- Pull: Server scrapes targets.
- Pros: Server controls the load. Explicit inventory.
- Cons: Need to discover targets (Service Discovery).
Industry Standard: Pull (Prometheus) is the winner for Kubernetes.
Practice Exercises
Exercise 1: Cardinality Explosion (Beginner)
Scenario: You add a label client_ip to your http_requests_total metric.
Task: Why does your monitoring bill suddenly increase by 100x? (Hint: How many unique IPs hit your site?).
Exercise 2: Sampling Rates (Intermediate)
Scenario: 10,000 Req/Sec. Tracing adds 1ms overhead per request. Task: If you sample 100% of requests, how much total CPU time is wasted per second on tracing? What is a safe sampling rate?
Exercise 3: SLO Calculation (Advanced)
Scenario: SLO: “99% of requests < 200ms”.
Task: Write a PromQL query using http_request_duration_bucket to check if this SLO is being met.
Knowledge Check
- Why can’t you calculate P99 from
avg_latencymetrics? - What is a “Span Context”?
- Why is logging “User ID” in a Metric label dangerous?
- How does Tail Sampling differ from Head Sampling?
- Which pillar is best for debugging a specific customer complaint?
Answers
- Math. Averages hide outliers. P99 requires distribution data (Histograms).
- Metadata. TraceID and ParentSpanID passed in HTTP Headers to Correlate services.
- High Cardinality. It creates infinite time series, crashing the database.
- Decision Timing. Head decides before the request starts (random). Tail decides after it ends (keep only errors/slow ones).
- Logs. (Or Traces if sampled). Metrics won’t show you the specific user’s error message.
Summary
- You get what you pay for. Logs are expensive but detailed. Metrics are cheap but vague.
- Cardinality kills. Watch your labels.
- Context is King. A trace without context is just noise.
Questions about this lesson? Working on related infrastructure?
Let's discuss