Kubernetes StatefulSets: Why Trading Systems Need State

Deployments assume pods are fungible. Any instance can handle any request. Trading systems are the opposite.

Your trading engine holds state: exchange connections, position tracking, order IDs. Kill a pod, lose the state, lose money. Restart with a different identity, create duplicate orders.

This post covers why StatefulSets are essential for trading, how they work internally, and the complete configuration pattern.

The Problem {#the-problem}

Deployment failure mode:

# WRONG
apiVersion: apps/v1
kind: Deployment
metadata:
  name: trading-bot
spec:
  replicas: 3

What happens:

All 3 pods connect to Binance
All 3 receive same market data
All 3 try to execute same trade
2/3 rejected as duplicates
Rate limits exhausted

Root cause: Pods have random names (trading-bot-7d8f9-xyz). No identity assignment. No leader election. No state persistence.

For the broader architecture context, see First Principles. For kernel-level tuning on Kubernetes nodes, see CPU Optimization.

Background: Kubernetes Scheduling {#background}

How Deployments Work

Deployments manage ReplicaSets (controller source):

Deployment → ReplicaSet → Pods

Key behaviors:

Pods get random suffixes
Any pod can be killed first during scale-down
PersistentVolumeClaims are shared (if any)
No ordering guarantees

How StatefulSets Work

StatefulSets provide ordered, persistent identity (controller source):

StatefulSet → Pods with stable names
                ↓
    pod-0, pod-1, pod-2 (always)

Key behaviors:

Pods get ordinal names: {statefulset}-0, {statefulset}-1
Ordered creation: 0 must be Running before 1 starts
Ordered deletion: N-1 deleted before N-2
Stable network identity via headless service
Per-pod PersistentVolumeClaims

Why This Matters for Trading

Requirement	Deployment	StatefulSet
Stable identity	No	Yes
Per-pod storage	Shared only	Per-pod
Ordered scaling	No	Yes
Network identity	Random	Stable DNS

Fix 1: StatefulSets for Identity {#statefulsets}

The Pattern

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: trading-engine
  namespace: trading
spec:
  serviceName: "trading-headless"
  replicas: 3
  podManagementPolicy: Parallel  # All start together (fast)
  updateStrategy:
    type: RollingUpdate
  selector:
    matchLabels:
      app: trading-engine
  template:
    metadata:
      labels:
        app: trading-engine
    spec:
      containers:
      - name: engine
        image: trading-engine:latest
        env:
        - name: POD_NAME
          valueFrom:
            fieldRef:
              fieldPath: metadata.name
        - name: ASSIGNED_MARKET
          # Application reads POD_NAME and derives assignment
          # trading-engine-0 → BTC
          # trading-engine-1 → ETH
          # trading-engine-2 → SOL

Application Logic

import os

POD_NAME = os.environ.get('POD_NAME', 'trading-engine-0')
POD_ORDINAL = int(POD_NAME.split('-')[-1])

MARKET_ASSIGNMENTS = {
    0: ['BTCUSDT', 'BTCUSD'],
    1: ['ETHUSDT', 'ETHUSD'],
    2: ['SOLUSDT', 'SOLUSD'],
}

my_markets = MARKET_ASSIGNMENTS.get(POD_ORDINAL, [])
print(f"Pod {POD_ORDINAL} handling markets: {my_markets}")

Expected Behavior

Event	Result
Pod-0 crashes	Pod-0 restarts (same identity, same markets)
Scale to 4	Pod-3 created, gets new market assignment
Scale to 2	Pod-2 deleted first (reverse order)

Fix 2: Headless Services {#headless}

The Problem

ClusterIP services load-balance. You can’t connect to a specific pod.

The Fix

apiVersion: v1
kind: Service
metadata:
  name: trading-headless
spec:
  clusterIP: None  # Headless
  selector:
    app: trading-engine
  ports:
  - port: 8080
    name: http
  - port: 9090
    name: metrics

How It Works

With headless service, each pod gets stable DNS:

trading-engine-0.trading-headless.trading.svc.cluster.local
trading-engine-1.trading-headless.trading.svc.cluster.local

Your risk engine can connect directly to each trading engine:

TRADING_ENGINES = [
    "trading-engine-0.trading-headless.trading.svc.cluster.local:8080",
    "trading-engine-1.trading-headless.trading.svc.cluster.local:8080",
]

for engine in TRADING_ENGINES:
    position = get_position(engine)

No load balancer in the path. Direct TCP connections.

Fix 3: Persistent Volumes {#pv}

The Problem

Trading engines need persistent state:

Order history (for reconciliation)
Position snapshots (for crash recovery)
WAL logs (for replay)

Without persistence, restart = lost state.

The Fix

volumeClaimTemplates:
- metadata:
    name: trading-data
  spec:
    accessModes: ["ReadWriteOnce"]
    storageClassName: "gp3-encrypted"
    resources:
      requests:
        storage: 50Gi

StorageClass for EKS:

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: gp3-encrypted
provisioner: ebs.csi.aws.com
volumeBindingMode: WaitForFirstConsumer
parameters:
  type: gp3
  encrypted: "true"
  iops: "3000"
  throughput: "125"

How It Works

Each pod gets its own PVC:

trading-data-trading-engine-0
trading-data-trading-engine-1

PVCs persist across pod restarts. Delete pod → PVC remains → New pod gets same PVC.

For EBS optimization, see Storage Deep Dive.

Fix 4: Graceful Shutdown {#shutdown}

The Problem

Default termination: SIGTERM → wait 30s → SIGKILL.

Trading needs:

Cancel open orders (5-30s)
Wait for exchange confirmations (10s)
Flush state (1s)

30 seconds isn’t enough if exchange is slow.

The Fix

spec:
  terminationGracePeriodSeconds: 120  # 2 minutes
  containers:
  - name: engine
    lifecycle:
      preStop:
        exec:
          command:
          - /bin/sh
          - -c
          - |
            # Signal application to stop trading
            curl -X POST http://localhost:8080/shutdown
            
            # Wait for order cancellations
            sleep 60
            
            # Final state flush happens in SIGTERM handler

Application Pattern

import signal
import time

shutdown_requested = False

def handle_sigterm(signum, frame):
    global shutdown_requested
    shutdown_requested = True
    
    # Cancel all open orders
    for order in get_open_orders():
        cancel_order(order)
    
    # Wait for confirmations
    while get_open_orders():
        time.sleep(1)
    
    # Flush state
    save_state_to_disk()
    
    sys.exit(0)

signal.signal(signal.SIGTERM, handle_sigterm)

Fix 5: Pod Disruption Budgets {#pdb}

The Problem

Kubernetes can evict pods during:

Node upgrades
Cluster autoscaler decisions
Spot instance reclaims

Without protection, all pods could be evicted simultaneously.

The Fix

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: trading-pdb
spec:
  minAvailable: 2  # At least 2 pods always running
  selector:
    matchLabels:
      app: trading-engine

How It Works

Voluntary disruptions (upgrades, autoscaler) respect PDB:

Want to evict pod-0
Check PDB: 3 running, need 2 minimum
Eviction allowed (3-1=2 ≥ 2)

Involuntary disruptions (node crash) don’t check PDB. You need multi-AZ for that.

Complete StatefulSet Example

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: trading-engine
  namespace: trading
spec:
  serviceName: "trading-headless"
  replicas: 3
  podManagementPolicy: Parallel
  updateStrategy:
    type: RollingUpdate
    rollingUpdate:
      partition: 0
  selector:
    matchLabels:
      app: trading-engine
  template:
    metadata:
      labels:
        app: trading-engine
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "9090"
    spec:
      terminationGracePeriodSeconds: 120
      
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: node-type
                operator: In
                values:
                - trading
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchLabels:
                  app: trading-engine
              topologyKey: topology.kubernetes.io/zone
      
      containers:
      - name: engine
        image: trading-engine:v1.2.3
        env:
        - name: POD_NAME
          valueFrom:
            fieldRef:
              fieldPath: metadata.name
        - name: NODE_NAME
          valueFrom:
            fieldRef:
              fieldPath: spec.nodeName
        
        resources:
          requests:
            memory: "4Gi"
            cpu: "2000m"
          limits:
            memory: "8Gi"
            cpu: "4000m"
        
        ports:
        - containerPort: 8080
          name: http
        - containerPort: 9090
          name: metrics
        
        volumeMounts:
        - name: trading-data
          mountPath: /data
        
        readinessProbe:
          httpGet:
            path: /ready
            port: 8080
          initialDelaySeconds: 10
          periodSeconds: 5
        
        livenessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 30
          periodSeconds: 10
        
        lifecycle:
          preStop:
            exec:
              command: 
              - /bin/sh
              - -c
              - "curl -X POST localhost:8080/shutdown && sleep 60"
  
  volumeClaimTemplates:
  - metadata:
      name: trading-data
    spec:
      accessModes: ["ReadWriteOnce"]
      storageClassName: "gp3-encrypted"
      resources:
        requests:
          storage: 50Gi

Design Philosophy {#design-philosophy}

Stateless vs Stateful

Kubernetes was designed for stateless. Original patterns assumed:

Ephemeral pods
Shared state in databases
Any pod handles any request

Trading is inherently stateful:

Exchange connections are stateful (WebSocket)
Position tracking requires memory
Order IDs need persistence

StatefulSets bridge this gap.

The Tradeoff

Deployment	StatefulSet
Simple scaling	Ordered scaling
Fast rollouts	Careful rollouts
No identity	Stable identity
Shared state	Per-pod state

StatefulSets are more complex. That complexity is the cost of correctness.

Audit Your Infrastructure

Running trading on Kubernetes? The underlying nodes still need kernel tuning. Run latency-audit on your node pools to check CPU governors, memory settings, and network configurations.

pip install latency-audit && latency-audit

Topic	Next Post
Design philosophy & architecture decisions	Trading Infrastructure: First Principles That Scale
CPU governors, C-states, NUMA, isolation	CPU Isolation for HFT: The isolcpus Lie and What Actually Works
NIC offloads, IRQ affinity, kernel bypass	Network Optimization: Kernel Bypass and the Art of Busy Polling
SLOs, metrics that matter, alerting	Trading Metrics: What SRE Dashboards Miss
The 5 kernel settings that cost you latency	The $2M Millisecond: Linux Defaults That Cost You Money

Nikhil Padala

Kubernetes StatefulSets: Why Trading Systems Need State

The Problem {#the-problem}

Background: Kubernetes Scheduling {#background}

How Deployments Work

How StatefulSets Work

Why This Matters for Trading

Fix 1: StatefulSets for Identity {#statefulsets}

The Pattern

Application Logic

Expected Behavior

Fix 2: Headless Services {#headless}

The Problem

The Fix

How It Works

Fix 3: Persistent Volumes {#pv}

The Problem

The Fix

How It Works

Fix 4: Graceful Shutdown {#shutdown}

The Problem

The Fix

Application Pattern

Fix 5: Pod Disruption Budgets {#pdb}

The Problem

The Fix

How It Works

Complete StatefulSet Example

Design Philosophy {#design-philosophy}

Stateless vs Stateful

The Tradeoff

Audit Your Infrastructure

Trading Metrics: What SRE Dashboards Miss

Reading Path