Trading Metrics: What SRE Dashboards Miss

Your trading system has 99.99% uptime. Congratulations. You’re measuring the wrong thing.

I’ve seen systems with perfect infrastructure dashboards lose $50K in a day. CPU was at 5%. Memory was fine. All services green. But fill latency had degraded from 50ms to 500ms, and nobody noticed until the PnL reconciliation.

This post covers the metrics that actually matter for trading, how to instrument them, and how to alert before money is lost.

The Problem {#the-problem}

Standard SRE metrics:

CPU usage
Memory utilization
Disk I/O
Network throughput
Service uptime

These are necessary but not sufficient. A server can have 5% CPU while:

Fills are taking 10x longer than expected
Positions have drifted from exchange state
Market data is 5 seconds stale
Rate limits are exhausted

The cost: By the time infrastructure metrics alert, you’ve already lost money.

For the infrastructure that these metrics run on, see:

First Principles - Architecture
Kubernetes for Trading - Pod monitoring
The Observer Effect - Low-overhead measurement

Fill Latency {#fill-latency}

What It Measures

Time from order submission to fill confirmation.

Why it matters: Fill latency directly impacts execution quality. A 500ms delay on a $1M order costs$ 500+ in slippage during volatile markets.

Implementation

from prometheus_client import Histogram
import time

FILL_LATENCY = Histogram(
    'trading_fill_latency_seconds',
    'Time from order submission to fill confirmation',
    ['exchange', 'symbol'],
    buckets=[0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 2, 5, 10]
)

class OrderManager:
    def __init__(self):
        self.pending_orders = {}
    
    def submit_order(self, order):
        order.submitted_at = time.time()
        self.pending_orders[order.id] = order
        self.exchange.submit(order)
    
    def on_fill(self, fill):
        order = self.pending_orders.pop(fill.order_id, None)
        if order:
            latency = time.time() - order.submitted_at
            FILL_LATENCY.labels(
                exchange=order.exchange,
                symbol=order.symbol
            ).observe(latency)

Prometheus Recording Rules

groups:
- name: trading_slos
  rules:
  - record: trading:fill_latency:p50_5m
    expr: histogram_quantile(0.50, sum(rate(trading_fill_latency_seconds_bucket[5m])) by (le, exchange))
  
  - record: trading:fill_latency:p95_5m
    expr: histogram_quantile(0.95, sum(rate(trading_fill_latency_seconds_bucket[5m])) by (le, exchange))
  
  - record: trading:fill_latency:p99_5m
    expr: histogram_quantile(0.99, sum(rate(trading_fill_latency_seconds_bucket[5m])) by (le, exchange))

Thresholds

Environment	P50	P95	P99
Cloud (acceptable)	<100ms	<500ms	<2s
Cloud (good)	<50ms	<200ms	<500ms
Collocated	<1ms	<5ms	<20ms

Position Drift {#position-drift}

What It Measures

Difference between your calculated position and exchange’s recorded position.

Why it matters: Drift means:

Missed fill messages
Network packet loss
Race conditions in reconciliation
Wrong risk calculations

Implementation

from prometheus_client import Gauge
import asyncio

POSITION_DRIFT = Gauge(
    'trading_position_drift_percent',
    'Difference between calculated and exchange position',
    ['exchange', 'symbol']
)

POSITION_DRIFT_ABS = Gauge(
    'trading_position_drift_absolute',
    'Absolute difference in position',
    ['exchange', 'symbol']
)

async def reconcile_positions():
    while True:
        for symbol in trading_symbols:
            calculated = database.get_position(symbol)
            exchange = await client.get_position(symbol)
            
            if calculated != 0:
                drift_pct = abs(calculated - exchange) / abs(calculated)
            else:
                drift_pct = abs(exchange)  # Any position when expecting 0 is 100% drift
            
            drift_abs = abs(calculated - exchange)
            
            POSITION_DRIFT.labels(
                exchange='binance',
                symbol=symbol
            ).set(drift_pct)
            
            POSITION_DRIFT_ABS.labels(
                exchange='binance',
                symbol=symbol
            ).set(drift_abs)
        
        await asyncio.sleep(60)  # Reconcile every minute

Alert Rules

- alert: PositionDrift
  expr: trading_position_drift_percent > 0.01  # 1%
  for: 30s
  labels:
    severity: critical
  annotations:
    summary: "Position drift > 1% on {{ $labels.symbol }}"
    description: "Calculated and exchange positions differ by {{ $value | humanizePercentage }}"

Thresholds

Normal: drift = 0%
Warning: drift > 0.5%
Page immediately: drift > 1%

Market Data Staleness {#staleness}

What It Measures

Time since last orderbook update.

Why it matters: Trading on stale data = trading blind. Common causes:

WebSocket silently disconnected
Exchange rate limiting
Network congestion
Parser crash

Implementation

from prometheus_client import Gauge
import time

MARKET_DATA_AGE = Gauge(
    'trading_market_data_age_seconds',
    'Time since last orderbook update',
    ['exchange', 'symbol']
)

last_update = {}

def on_orderbook_update(exchange: str, symbol: str, data: dict):
    last_update[(exchange, symbol)] = time.time()
    # Process orderbook...

async def staleness_monitor():
    while True:
        now = time.time()
        for (exchange, symbol), ts in last_update.items():
            age = now - ts
            MARKET_DATA_AGE.labels(
                exchange=exchange,
                symbol=symbol
            ).set(age)
        await asyncio.sleep(1)

Alert Rules

- alert: StaleMarketData
  expr: trading_market_data_age_seconds > 5
  for: 10s
  labels:
    severity: critical
  annotations:
    summary: "No market data for 5s on {{ $labels.symbol }}"
    runbook: "Check WebSocket connection, reconnect if needed"

Order Rejection Rate {#rejections}

What It Measures

Percentage of orders rejected by exchange.

Why it matters: High rejections indicate:

Insufficient margin
Invalid order sizes
Rate limits exceeded
Exchange maintenance
Bugs in order construction

Implementation

from prometheus_client import Counter

ORDERS_SUBMITTED = Counter(
    'trading_orders_submitted_total',
    'Total orders submitted',
    ['exchange', 'type']
)

ORDERS_REJECTED = Counter(
    'trading_orders_rejected_total',
    'Total orders rejected',
    ['exchange', 'reason']
)

ORDERS_FILLED = Counter(
    'trading_orders_filled_total',
    'Total orders filled',
    ['exchange']
)

def submit_order(order):
    ORDERS_SUBMITTED.labels(
        exchange=order.exchange,
        type=order.type
    ).inc()
    # Submit...

def on_rejection(order, reason):
    ORDERS_REJECTED.labels(
        exchange=order.exchange,
        reason=reason
    ).inc()

def on_fill(fill):
    ORDERS_FILLED.labels(exchange=fill.exchange).inc()

Recording Rules

- record: trading:order_rejection_rate:5m
  expr: |
    sum(rate(trading_orders_rejected_total[5m])) by (exchange)
    /
    sum(rate(trading_orders_submitted_total[5m])) by (exchange)

Thresholds

Normal: <1%
Warning: >3%
Critical: >5%

Rate Limit Headroom {#rate-limits}

What It Measures

How close you are to exchange rate limits.

Why it matters: Hit rate limits = orders rejected = missed opportunities.

Implementation

from prometheus_client import Gauge

RATE_LIMIT_USED = Gauge(
    'trading_rate_limit_used_percent',
    'Percentage of rate limit consumed',
    ['exchange', 'endpoint']
)

class RateLimitedClient:
    def __init__(self, limit_per_minute: int):
        self.limit = limit_per_minute
        self.requests = []
    
    def request(self, endpoint: str):
        now = time.time()
        self.requests = [t for t in self.requests if now - t < 60]
        
        used_pct = len(self.requests) / self.limit
        RATE_LIMIT_USED.labels(
            exchange='binance',
            endpoint=endpoint
        ).set(used_pct * 100)
        
        self.requests.append(now)
        # Make request...

Thresholds

Normal: <50%
Warning: >70%
Critical: >90%

Alerting Strategy {#alerting}

Alert Hierarchy

# AlertManager config
route:
  receiver: 'slack-trading'
  routes:
  - match:
      severity: critical
    receiver: 'pagerduty-trading'
    continue: true
  - match:
      severity: warning
    receiver: 'slack-trading'

receivers:
- name: 'pagerduty-trading'
  pagerduty_configs:
  - service_key: '{{ PAGERDUTY_KEY }}'
    severity: critical

- name: 'slack-trading'
  slack_configs:
  - channel: '#trading-alerts'
    send_resolved: true

What Wakes You Up

PAGE IMMEDIATELY (2am wake-up):
├── Position drift > 1%
├── No fills in 5 minutes (market hours)
├── Market data stale > 30 seconds
├── Loss limit exceeded
└── WebSocket down all exchanges

SLACK (check in morning):
├── Fill latency P99 > 1s
├── Rejection rate > 3%
├── Rate limit > 70%
└── High reconnection rate

Runbook Pattern

annotations:
  summary: "Position drift detected"
  description: "{{ $labels.symbol }}: {{ $value | humanizePercentage }} drift"
  runbook: |
    1. Check /debug/positions endpoint for details
    2. Compare with exchange API positions
    3. If exchange is source of truth, force reconciliation
    4. If our system is wrong, investigate fill processing

Grafana Dashboard

{
  "title": "Trading SLOs",
  "panels": [
    {
      "title": "Fill Latency P99 by Exchange",
      "type": "graph",
      "targets": [
        {
          "expr": "trading:fill_latency:p99_5m"
        }
      ]
    },
    {
      "title": "Position Drift",
      "type": "gauge",
      "targets": [
        {
          "expr": "max(trading_position_drift_percent) * 100"
        }
      ],
      "fieldConfig": {
        "defaults": {
          "max": 5,
          "thresholds": {
            "steps": [
              {"value": 0, "color": "green"},
              {"value": 0.5, "color": "yellow"},
              {"value": 1, "color": "red"}
            ]
          }
        }
      }
    },
    {
      "title": "Market Data Staleness",
      "type": "stat",
      "targets": [
        {
          "expr": "max(trading_market_data_age_seconds)"
        }
      ]
    }
  ]
}

Design Philosophy {#design-philosophy}

Infrastructure vs Business Metrics

Metric Type	Example	What It Tells You
Infrastructure	CPU = 50%	System is working
Application	Requests/sec = 1000	System is handling load
Business	Fill latency P99 = 500ms	Money is at risk

Prioritize business metrics. Infrastructure metrics are necessary but not sufficient.

The SLO Hierarchy

Don’t lose money (position drift, loss limits)
Don’t miss opportunities (fill latency, staleness)
Don’t get blocked (rate limits, rejections)
Stay operational (uptime, errors)

Most teams monitor #4 first. Start with #1.

Topic	Next Post
Measuring without overhead using eBPF	eBPF Profiling: Nanoseconds Without Adding Any
Design philosophy & architecture decisions	Trading Infrastructure: First Principles That Scale
The 5 kernel settings that cost you latency	The $2M Millisecond: Linux Defaults That Cost You Money
StatefulSets, pod placement, EKS patterns	Kubernetes StatefulSets: Why Trading Systems Need State

Nikhil Padala

Trading Metrics: What SRE Dashboards Miss

The Problem {#the-problem}

Fill Latency {#fill-latency}

What It Measures

Implementation

Prometheus Recording Rules

Thresholds

Position Drift {#position-drift}

What It Measures

Implementation

Alert Rules

Thresholds

Market Data Staleness {#staleness}

What It Measures

Implementation

Alert Rules

Order Rejection Rate {#rejections}

What It Measures

Implementation

Recording Rules

Thresholds

Rate Limit Headroom {#rate-limits}

What It Measures

Implementation

Thresholds

Alerting Strategy {#alerting}

Alert Hierarchy

What Wakes You Up

Runbook Pattern

Grafana Dashboard

Design Philosophy {#design-philosophy}

Infrastructure vs Business Metrics

The SLO Hierarchy

eBPF Profiling: Nanoseconds Without Adding Any

Reading Path