High availability
An AGLedger deployment has three tiers with three different availability stories:
| Tier | State | HA approach |
|---|---|---|
| API | Stateless | Scale replicas; one env footgun (RATE_LIMIT_STORE) |
| Worker | Stateless (queue in Postgres) | Scale replicas; no special handling |
| PostgreSQL | All of it | Managed failover (Aurora, RDS Multi-AZ, Patroni) |
The chain lives entirely in PostgreSQL. Application pods hold no durable state — every tamper-evidence guarantee survives any pod being killed at any moment, because chain appends are single transactions serialized per record by a database advisory lock. There is no HA mode to turn on in the application; there is a deployment shape to configure.
API tier
API pods are stateless and scale horizontally. Three changes from the single-node default, all in the chart's values:
api:
replicaCount: 2 # or enable the HPA
strategy:
type: RollingUpdate # default is Recreate, sized for single-node clusters
extraEnv: # required when running more than one API replica
- name: RATE_LIMIT_STORE
value: postgresql
The default rate-limit store is in-process memory, which is correct for one replica and silently wrong for N — each replica enforces its own counters, multiplying every effective limit by N. The PostgreSQL store gives cluster-wide limits. (If the store cannot reach a database pool it falls back to in-memory rather than failing requests, so set it and confirm rather than assume.)
Everything else is already multi-replica-safe: signing-key and discovery caches invalidate across
replicas via Postgres LISTEN/NOTIFY, and migrations cannot race. In the chart, migrations run as a
pre-install/pre-upgrade Job — one runner per release, not one per pod. The optional startup runner
(RUN_MIGRATIONS, off by default) takes a global advisory lock and verifies migration checksums;
a pod that loses the lock exits and restarts rather than running migrations twice.
The chart ships an HPA, a PodDisruptionBudget, and a commented pod-anti-affinity example (spread
replicas across nodes) for the API deployment — see values.yaml.
Worker tier
Worker replicas scale the same way (worker.replicaCount), with no special configuration:
- Queue jobs (gate evaluation, webhook delivery, federation outbound) are claimed from
Postgres with
SKIP LOCKEDsemantics — each job is processed by exactly one replica. - Scheduled maintenance (expiry sweeps, vault checkpoints — which include anchor uploads — and recovery sweeps) is cluster-singleton: pg-boss claims each cron tick atomically in the database, so one replica fires it regardless of replica count. We tested this directly — three concurrent workers, one schedule, exactly one job per tick.
A single worker replica is fully functional and is the right default; add replicas for queue throughput or for zero-gap coverage across pod restarts. The chart ships a worker HPA and PDB.
Database tier
PostgreSQL is the single stateful component, so the availability of your deployment is the availability of your database. Use a managed failover-capable Postgres: Aurora, RDS Multi-AZ, or an equivalent (Patroni/CloudNativePG for self-managed clusters; any failover-capable PostgreSQL 17+ works — nothing here is AWS-specific).
During a failover, in-flight writes fail and clients receive errors until the new primary is up; the application's connection pool recovers on its own as connections are checked out. Chain integrity is unaffected by design — an append either committed as one transaction or it did not. There is no partial chain state to repair, and verification after a failover requires nothing beyond the normal audit runbook.
The bundled Postgres (the Compose overlay docker-compose.postgres.yml, or the chart's
postgres.bundled deployment) is a single-node convenience, explicitly not an HA database.
Running on an external database is the Enterprise license boundary: the Developer Edition license
covers production use on the bundled single-node Postgres, and pointing the Server at Aurora or
any other external Postgres is Enterprise scope — which is why high availability is an
Enterprise-tier line item. The boundary is contractual, not technical: the Server detects the
external-database case and logs a persistent licensing notice, but never blocks or degrades.
External anchoring is unaffected by all of this: anchors are written to object storage outside the database and remain valid ground truth across any failover.
The supported HA shape, summarized
- API:
replicaCount: 2+,RollingUpdate,RATE_LIMIT_STORE=postgresql, anti-affinity across nodes. - Worker:
replicaCount: 1default,2+for throughput — safe either way. - PostgreSQL: managed failover (Enterprise license), TLS required in production.
- Anchoring enabled, to a separately-administered bucket.
What is not supported today: multi-region active-active against one logical chain, and any deployment shape where two Servers write the same database schema independently. One Server, one database; scale the stateless tiers; let the database layer own failover. For multi-site topologies, run sovereign Servers per site and connect them with federation.
Validated against API v1.0.0 on 2026-06-10. Worker-scaling claim run-tested: three concurrent worker instances against one database produced exactly one scheduled job per cron tick.