Infrastructure

Trading Metrics: What SRE Dashboards Miss

Fill latency, position drift, market data staleness. The SLOs that prevent losses, not just track uptime. Prometheus, Grafana, and alerting patterns.

7 min
#monitoring #observability #trading #sre #prometheus #grafana #infrastructure

Your trading system has 99.99% uptime. Congratulations. You’re measuring the wrong thing.

I’ve seen systems with perfect infrastructure dashboards lose $50K in a day. CPU was at 5%. Memory was fine. All services green. But fill latency had degraded from 50ms to 500ms, and nobody noticed until the PnL reconciliation.

This post covers the metrics that actually matter for trading, how to instrument them, and how to alert before money is lost.

The Problem {#the-problem}

Standard SRE metrics:

  • CPU usage
  • Memory utilization
  • Disk I/O
  • Network throughput
  • Service uptime

These are necessary but not sufficient. A server can have 5% CPU while:

  • Fills are taking 10x longer than expected
  • Positions have drifted from exchange state
  • Market data is 5 seconds stale
  • Rate limits are exhausted

The cost: By the time infrastructure metrics alert, you’ve already lost money.

For the infrastructure that these metrics run on, see:

Fill Latency {#fill-latency}

What It Measures

Time from order submission to fill confirmation.

Why it matters: Fill latency directly impacts execution quality. A 500ms delay on a 1Mordercosts1M order costs 500+ in slippage during volatile markets.

Implementation

from prometheus_client import Histogram
import time

FILL_LATENCY = Histogram(
    'trading_fill_latency_seconds',
    'Time from order submission to fill confirmation',
    ['exchange', 'symbol'],
    buckets=[0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 2, 5, 10]
)

class OrderManager:
    def __init__(self):
        self.pending_orders = {}
    
    def submit_order(self, order):
        order.submitted_at = time.time()
        self.pending_orders[order.id] = order
        self.exchange.submit(order)
    
    def on_fill(self, fill):
        order = self.pending_orders.pop(fill.order_id, None)
        if order:
            latency = time.time() - order.submitted_at
            FILL_LATENCY.labels(
                exchange=order.exchange,
                symbol=order.symbol
            ).observe(latency)

Prometheus Recording Rules

groups:
- name: trading_slos
  rules:
  - record: trading:fill_latency:p50_5m
    expr: histogram_quantile(0.50, sum(rate(trading_fill_latency_seconds_bucket[5m])) by (le, exchange))
  
  - record: trading:fill_latency:p95_5m
    expr: histogram_quantile(0.95, sum(rate(trading_fill_latency_seconds_bucket[5m])) by (le, exchange))
  
  - record: trading:fill_latency:p99_5m
    expr: histogram_quantile(0.99, sum(rate(trading_fill_latency_seconds_bucket[5m])) by (le, exchange))

Thresholds

EnvironmentP50P95P99
Cloud (acceptable)<100ms<500ms<2s
Cloud (good)<50ms<200ms<500ms
Collocated<1ms<5ms<20ms

Position Drift {#position-drift}

What It Measures

Difference between your calculated position and exchange’s recorded position.

Why it matters: Drift means:

  • Missed fill messages
  • Network packet loss
  • Race conditions in reconciliation
  • Wrong risk calculations

Implementation

from prometheus_client import Gauge
import asyncio

POSITION_DRIFT = Gauge(
    'trading_position_drift_percent',
    'Difference between calculated and exchange position',
    ['exchange', 'symbol']
)

POSITION_DRIFT_ABS = Gauge(
    'trading_position_drift_absolute',
    'Absolute difference in position',
    ['exchange', 'symbol']
)

async def reconcile_positions():
    while True:
        for symbol in trading_symbols:
            calculated = database.get_position(symbol)
            exchange = await client.get_position(symbol)
            
            if calculated != 0:
                drift_pct = abs(calculated - exchange) / abs(calculated)
            else:
                drift_pct = abs(exchange)  # Any position when expecting 0 is 100% drift
            
            drift_abs = abs(calculated - exchange)
            
            POSITION_DRIFT.labels(
                exchange='binance',
                symbol=symbol
            ).set(drift_pct)
            
            POSITION_DRIFT_ABS.labels(
                exchange='binance',
                symbol=symbol
            ).set(drift_abs)
        
        await asyncio.sleep(60)  # Reconcile every minute

Alert Rules

- alert: PositionDrift
  expr: trading_position_drift_percent > 0.01  # 1%
  for: 30s
  labels:
    severity: critical
  annotations:
    summary: "Position drift > 1% on {{ $labels.symbol }}"
    description: "Calculated and exchange positions differ by {{ $value | humanizePercentage }}"

Thresholds

  • Normal: drift = 0%
  • Warning: drift > 0.5%
  • Page immediately: drift > 1%

Market Data Staleness {#staleness}

What It Measures

Time since last orderbook update.

Why it matters: Trading on stale data = trading blind. Common causes:

  • WebSocket silently disconnected
  • Exchange rate limiting
  • Network congestion
  • Parser crash

Implementation

from prometheus_client import Gauge
import time

MARKET_DATA_AGE = Gauge(
    'trading_market_data_age_seconds',
    'Time since last orderbook update',
    ['exchange', 'symbol']
)

last_update = {}

def on_orderbook_update(exchange: str, symbol: str, data: dict):
    last_update[(exchange, symbol)] = time.time()
    # Process orderbook...

async def staleness_monitor():
    while True:
        now = time.time()
        for (exchange, symbol), ts in last_update.items():
            age = now - ts
            MARKET_DATA_AGE.labels(
                exchange=exchange,
                symbol=symbol
            ).set(age)
        await asyncio.sleep(1)

Alert Rules

- alert: StaleMarketData
  expr: trading_market_data_age_seconds > 5
  for: 10s
  labels:
    severity: critical
  annotations:
    summary: "No market data for 5s on {{ $labels.symbol }}"
    runbook: "Check WebSocket connection, reconnect if needed"

Order Rejection Rate {#rejections}

What It Measures

Percentage of orders rejected by exchange.

Why it matters: High rejections indicate:

  • Insufficient margin
  • Invalid order sizes
  • Rate limits exceeded
  • Exchange maintenance
  • Bugs in order construction

Implementation

from prometheus_client import Counter

ORDERS_SUBMITTED = Counter(
    'trading_orders_submitted_total',
    'Total orders submitted',
    ['exchange', 'type']
)

ORDERS_REJECTED = Counter(
    'trading_orders_rejected_total',
    'Total orders rejected',
    ['exchange', 'reason']
)

ORDERS_FILLED = Counter(
    'trading_orders_filled_total',
    'Total orders filled',
    ['exchange']
)

def submit_order(order):
    ORDERS_SUBMITTED.labels(
        exchange=order.exchange,
        type=order.type
    ).inc()
    # Submit...

def on_rejection(order, reason):
    ORDERS_REJECTED.labels(
        exchange=order.exchange,
        reason=reason
    ).inc()

def on_fill(fill):
    ORDERS_FILLED.labels(exchange=fill.exchange).inc()

Recording Rules

- record: trading:order_rejection_rate:5m
  expr: |
    sum(rate(trading_orders_rejected_total[5m])) by (exchange)
    /
    sum(rate(trading_orders_submitted_total[5m])) by (exchange)

Thresholds

  • Normal: <1%
  • Warning: >3%
  • Critical: >5%

Rate Limit Headroom {#rate-limits}

What It Measures

How close you are to exchange rate limits.

Why it matters: Hit rate limits = orders rejected = missed opportunities.

Implementation

from prometheus_client import Gauge

RATE_LIMIT_USED = Gauge(
    'trading_rate_limit_used_percent',
    'Percentage of rate limit consumed',
    ['exchange', 'endpoint']
)

class RateLimitedClient:
    def __init__(self, limit_per_minute: int):
        self.limit = limit_per_minute
        self.requests = []
    
    def request(self, endpoint: str):
        now = time.time()
        self.requests = [t for t in self.requests if now - t < 60]
        
        used_pct = len(self.requests) / self.limit
        RATE_LIMIT_USED.labels(
            exchange='binance',
            endpoint=endpoint
        ).set(used_pct * 100)
        
        self.requests.append(now)
        # Make request...

Thresholds

  • Normal: <50%
  • Warning: >70%
  • Critical: >90%

Alerting Strategy {#alerting}

Alert Hierarchy

# AlertManager config
route:
  receiver: 'slack-trading'
  routes:
  - match:
      severity: critical
    receiver: 'pagerduty-trading'
    continue: true
  - match:
      severity: warning
    receiver: 'slack-trading'

receivers:
- name: 'pagerduty-trading'
  pagerduty_configs:
  - service_key: '{{ PAGERDUTY_KEY }}'
    severity: critical

- name: 'slack-trading'
  slack_configs:
  - channel: '#trading-alerts'
    send_resolved: true

What Wakes You Up

PAGE IMMEDIATELY (2am wake-up):
├── Position drift > 1%
├── No fills in 5 minutes (market hours)
├── Market data stale > 30 seconds
├── Loss limit exceeded
└── WebSocket down all exchanges

SLACK (check in morning):
├── Fill latency P99 > 1s
├── Rejection rate > 3%
├── Rate limit > 70%
└── High reconnection rate

Runbook Pattern

annotations:
  summary: "Position drift detected"
  description: "{{ $labels.symbol }}: {{ $value | humanizePercentage }} drift"
  runbook: |
    1. Check /debug/positions endpoint for details
    2. Compare with exchange API positions
    3. If exchange is source of truth, force reconciliation
    4. If our system is wrong, investigate fill processing

Grafana Dashboard

{
  "title": "Trading SLOs",
  "panels": [
    {
      "title": "Fill Latency P99 by Exchange",
      "type": "graph",
      "targets": [
        {
          "expr": "trading:fill_latency:p99_5m"
        }
      ]
    },
    {
      "title": "Position Drift",
      "type": "gauge",
      "targets": [
        {
          "expr": "max(trading_position_drift_percent) * 100"
        }
      ],
      "fieldConfig": {
        "defaults": {
          "max": 5,
          "thresholds": {
            "steps": [
              {"value": 0, "color": "green"},
              {"value": 0.5, "color": "yellow"},
              {"value": 1, "color": "red"}
            ]
          }
        }
      }
    },
    {
      "title": "Market Data Staleness",
      "type": "stat",
      "targets": [
        {
          "expr": "max(trading_market_data_age_seconds)"
        }
      ]
    }
  ]
}

Design Philosophy {#design-philosophy}

Infrastructure vs Business Metrics

Metric TypeExampleWhat It Tells You
InfrastructureCPU = 50%System is working
ApplicationRequests/sec = 1000System is handling load
BusinessFill latency P99 = 500msMoney is at risk

Prioritize business metrics. Infrastructure metrics are necessary but not sufficient.

The SLO Hierarchy

  1. Don’t lose money (position drift, loss limits)
  2. Don’t miss opportunities (fill latency, staleness)
  3. Don’t get blocked (rate limits, rejections)
  4. Stay operational (uptime, errors)

Most teams monitor #4 first. Start with #1.

Reading Path

Continue exploring with these related deep dives:

TopicNext Post
Measuring without overhead using eBPFeBPF Profiling: Nanoseconds Without Adding Any
Design philosophy & architecture decisionsTrading Infrastructure: First Principles That Scale
The 5 kernel settings that cost you latencyThe $2M Millisecond: Linux Defaults That Cost You Money
StatefulSets, pod placement, EKS patternsKubernetes StatefulSets: Why Trading Systems Need State
Share: LinkedIn X