Infrastructure
Kubernetes StatefulSets: Why Trading Systems Need State
Deep dive into StatefulSets vs Deployments, pod identity, PersistentVolumes, and graceful shutdown patterns for trading infrastructure.
Deployments assume pods are fungible. Any instance can handle any request. Trading systems are the opposite.
Your trading engine holds state: exchange connections, position tracking, order IDs. Kill a pod, lose the state, lose money. Restart with a different identity, create duplicate orders.
This post covers why StatefulSets are essential for trading, how they work internally, and the complete configuration pattern.
The Problem {#the-problem}
Deployment failure mode:
# WRONG
apiVersion: apps/v1
kind: Deployment
metadata:
name: trading-bot
spec:
replicas: 3
What happens:
- All 3 pods connect to Binance
- All 3 receive same market data
- All 3 try to execute same trade
- 2/3 rejected as duplicates
- Rate limits exhausted
Root cause: Pods have random names (trading-bot-7d8f9-xyz). No identity assignment. No leader election. No state persistence.
For the broader architecture context, see First Principles. For kernel-level tuning on Kubernetes nodes, see CPU Optimization.
Background: Kubernetes Scheduling {#background}
How Deployments Work
Deployments manage ReplicaSets (controller source):
Deployment → ReplicaSet → Pods
Key behaviors:
- Pods get random suffixes
- Any pod can be killed first during scale-down
- PersistentVolumeClaims are shared (if any)
- No ordering guarantees
How StatefulSets Work
StatefulSets provide ordered, persistent identity (controller source):
StatefulSet → Pods with stable names
↓
pod-0, pod-1, pod-2 (always)
Key behaviors:
- Pods get ordinal names:
{statefulset}-0,{statefulset}-1 - Ordered creation: 0 must be Running before 1 starts
- Ordered deletion: N-1 deleted before N-2
- Stable network identity via headless service
- Per-pod PersistentVolumeClaims
Why This Matters for Trading
| Requirement | Deployment | StatefulSet |
|---|---|---|
| Stable identity | No | Yes |
| Per-pod storage | Shared only | Per-pod |
| Ordered scaling | No | Yes |
| Network identity | Random | Stable DNS |
Fix 1: StatefulSets for Identity {#statefulsets}
The Pattern
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: trading-engine
namespace: trading
spec:
serviceName: "trading-headless"
replicas: 3
podManagementPolicy: Parallel # All start together (fast)
updateStrategy:
type: RollingUpdate
selector:
matchLabels:
app: trading-engine
template:
metadata:
labels:
app: trading-engine
spec:
containers:
- name: engine
image: trading-engine:latest
env:
- name: POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: ASSIGNED_MARKET
# Application reads POD_NAME and derives assignment
# trading-engine-0 → BTC
# trading-engine-1 → ETH
# trading-engine-2 → SOL
Application Logic
import os
POD_NAME = os.environ.get('POD_NAME', 'trading-engine-0')
POD_ORDINAL = int(POD_NAME.split('-')[-1])
MARKET_ASSIGNMENTS = {
0: ['BTCUSDT', 'BTCUSD'],
1: ['ETHUSDT', 'ETHUSD'],
2: ['SOLUSDT', 'SOLUSD'],
}
my_markets = MARKET_ASSIGNMENTS.get(POD_ORDINAL, [])
print(f"Pod {POD_ORDINAL} handling markets: {my_markets}")
Expected Behavior
| Event | Result |
|---|---|
| Pod-0 crashes | Pod-0 restarts (same identity, same markets) |
| Scale to 4 | Pod-3 created, gets new market assignment |
| Scale to 2 | Pod-2 deleted first (reverse order) |
Fix 2: Headless Services {#headless}
The Problem
ClusterIP services load-balance. You can’t connect to a specific pod.
The Fix
apiVersion: v1
kind: Service
metadata:
name: trading-headless
spec:
clusterIP: None # Headless
selector:
app: trading-engine
ports:
- port: 8080
name: http
- port: 9090
name: metrics
How It Works
With headless service, each pod gets stable DNS:
trading-engine-0.trading-headless.trading.svc.cluster.localtrading-engine-1.trading-headless.trading.svc.cluster.local
Your risk engine can connect directly to each trading engine:
TRADING_ENGINES = [
"trading-engine-0.trading-headless.trading.svc.cluster.local:8080",
"trading-engine-1.trading-headless.trading.svc.cluster.local:8080",
]
for engine in TRADING_ENGINES:
position = get_position(engine)
No load balancer in the path. Direct TCP connections.
Fix 3: Persistent Volumes {#pv}
The Problem
Trading engines need persistent state:
- Order history (for reconciliation)
- Position snapshots (for crash recovery)
- WAL logs (for replay)
Without persistence, restart = lost state.
The Fix
volumeClaimTemplates:
- metadata:
name: trading-data
spec:
accessModes: ["ReadWriteOnce"]
storageClassName: "gp3-encrypted"
resources:
requests:
storage: 50Gi
StorageClass for EKS:
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: gp3-encrypted
provisioner: ebs.csi.aws.com
volumeBindingMode: WaitForFirstConsumer
parameters:
type: gp3
encrypted: "true"
iops: "3000"
throughput: "125"
How It Works
Each pod gets its own PVC:
trading-data-trading-engine-0trading-data-trading-engine-1
PVCs persist across pod restarts. Delete pod → PVC remains → New pod gets same PVC.
For EBS optimization, see Storage Deep Dive.
Fix 4: Graceful Shutdown {#shutdown}
The Problem
Default termination: SIGTERM → wait 30s → SIGKILL.
Trading needs:
- Cancel open orders (5-30s)
- Wait for exchange confirmations (10s)
- Flush state (1s)
30 seconds isn’t enough if exchange is slow.
The Fix
spec:
terminationGracePeriodSeconds: 120 # 2 minutes
containers:
- name: engine
lifecycle:
preStop:
exec:
command:
- /bin/sh
- -c
- |
# Signal application to stop trading
curl -X POST http://localhost:8080/shutdown
# Wait for order cancellations
sleep 60
# Final state flush happens in SIGTERM handler
Application Pattern
import signal
import time
shutdown_requested = False
def handle_sigterm(signum, frame):
global shutdown_requested
shutdown_requested = True
# Cancel all open orders
for order in get_open_orders():
cancel_order(order)
# Wait for confirmations
while get_open_orders():
time.sleep(1)
# Flush state
save_state_to_disk()
sys.exit(0)
signal.signal(signal.SIGTERM, handle_sigterm)
Fix 5: Pod Disruption Budgets {#pdb}
The Problem
Kubernetes can evict pods during:
- Node upgrades
- Cluster autoscaler decisions
- Spot instance reclaims
Without protection, all pods could be evicted simultaneously.
The Fix
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: trading-pdb
spec:
minAvailable: 2 # At least 2 pods always running
selector:
matchLabels:
app: trading-engine
How It Works
Voluntary disruptions (upgrades, autoscaler) respect PDB:
- Want to evict pod-0
- Check PDB: 3 running, need 2 minimum
- Eviction allowed (3-1=2 ≥ 2)
Involuntary disruptions (node crash) don’t check PDB. You need multi-AZ for that.
Complete StatefulSet Example
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: trading-engine
namespace: trading
spec:
serviceName: "trading-headless"
replicas: 3
podManagementPolicy: Parallel
updateStrategy:
type: RollingUpdate
rollingUpdate:
partition: 0
selector:
matchLabels:
app: trading-engine
template:
metadata:
labels:
app: trading-engine
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "9090"
spec:
terminationGracePeriodSeconds: 120
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: node-type
operator: In
values:
- trading
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchLabels:
app: trading-engine
topologyKey: topology.kubernetes.io/zone
containers:
- name: engine
image: trading-engine:v1.2.3
env:
- name: POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: NODE_NAME
valueFrom:
fieldRef:
fieldPath: spec.nodeName
resources:
requests:
memory: "4Gi"
cpu: "2000m"
limits:
memory: "8Gi"
cpu: "4000m"
ports:
- containerPort: 8080
name: http
- containerPort: 9090
name: metrics
volumeMounts:
- name: trading-data
mountPath: /data
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 10
periodSeconds: 5
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
lifecycle:
preStop:
exec:
command:
- /bin/sh
- -c
- "curl -X POST localhost:8080/shutdown && sleep 60"
volumeClaimTemplates:
- metadata:
name: trading-data
spec:
accessModes: ["ReadWriteOnce"]
storageClassName: "gp3-encrypted"
resources:
requests:
storage: 50Gi
Design Philosophy {#design-philosophy}
Stateless vs Stateful
Kubernetes was designed for stateless. Original patterns assumed:
- Ephemeral pods
- Shared state in databases
- Any pod handles any request
Trading is inherently stateful:
- Exchange connections are stateful (WebSocket)
- Position tracking requires memory
- Order IDs need persistence
StatefulSets bridge this gap.
The Tradeoff
| Deployment | StatefulSet |
|---|---|
| Simple scaling | Ordered scaling |
| Fast rollouts | Careful rollouts |
| No identity | Stable identity |
| Shared state | Per-pod state |
StatefulSets are more complex. That complexity is the cost of correctness.
Audit Your Infrastructure
Running trading on Kubernetes? The underlying nodes still need kernel tuning. Run latency-audit on your node pools to check CPU governors, memory settings, and network configurations.
pip install latency-audit && latency-audit Up Next in Linux Infrastructure Deep Dives
Trading Metrics: What SRE Dashboards Miss
Fill latency, position drift, market data staleness. The SLOs that prevent losses, not just track uptime. Prometheus, Grafana, and alerting patterns.
Reading Path
Continue exploring with these related deep dives:
| Topic | Next Post |
|---|---|
| Design philosophy & architecture decisions | Trading Infrastructure: First Principles That Scale |
| CPU governors, C-states, NUMA, isolation | CPU Isolation for HFT: The isolcpus Lie and What Actually Works |
| NIC offloads, IRQ affinity, kernel bypass | Network Optimization: Kernel Bypass and the Art of Busy Polling |
| SLOs, metrics that matter, alerting | Trading Metrics: What SRE Dashboards Miss |
| The 5 kernel settings that cost you latency | The $2M Millisecond: Linux Defaults That Cost You Money |