High availability

An AGLedger deployment has three tiers with three different availability stories:

| Tier | State | HA approach | |---|---|---| | API | Stateless | Scale replicas; one env footgun (RATE_LIMIT_STORE) | | Worker | Stateless (queue in Postgres) | Scale replicas; no special handling | | PostgreSQL | All of it | Managed failover (Aurora, RDS Multi-AZ, Patroni) |

The chain lives entirely in PostgreSQL. Application pods hold no durable state - every tamper-evidence guarantee survives any pod being killed at any moment, because chain appends are single transactions serialized per record by a database advisory lock. There is no HA mode to turn on in the application; there is a deployment shape to configure.

API tier

API pods are stateless and scale horizontally. Three changes from the single-node default, all in the chart's values:

api:
  replicaCount: 2          # or enable the HPA
  strategy:
    type: RollingUpdate    # default is Recreate, sized for single-node clusters

extraEnv:                  # required when running more than one API replica
  - name: RATE_LIMIT_STORE
    value: postgresql

The default rate-limit store is in-process memory, which is correct for one replica and silently wrong for N - each replica enforces its own counters, multiplying every effective limit by N. The PostgreSQL store gives cluster-wide limits. (If the store cannot reach a database pool it falls back to in-memory rather than failing requests, so set it and confirm rather than assume.)

Everything else is already multi-replica-safe: signing-key and discovery caches invalidate across replicas via Postgres LISTEN/NOTIFY, and migrations cannot race. In the chart, migrations run as a pre-install/pre-upgrade Job - one runner per release, not one per pod. The optional startup runner (RUN_MIGRATIONS, off by default) takes a global advisory lock and verifies migration checksums; a pod that loses the lock exits and restarts rather than running migrations twice.

The chart ships an HPA, a PodDisruptionBudget, and a commented pod-anti-affinity example (spread replicas across nodes) for the API deployment - see values.yaml.

Worker tier

Worker replicas scale the same way (worker.replicaCount), with no special configuration:

Queue jobs (gate evaluation, webhook delivery, federation outbound) are claimed from Postgres with SKIP LOCKED semantics - each job is processed by exactly one replica.
Scheduled maintenance (expiry sweeps, vault checkpoints - which include anchor uploads - and recovery sweeps) is cluster-singleton: pg-boss claims each cron tick atomically in the database, so one replica fires it regardless of replica count. We tested this directly - three concurrent workers, one schedule, exactly one job per tick.

A single worker replica is fully functional and is the right default; add replicas for queue throughput or for zero-gap coverage across pod restarts. The chart ships a worker HPA and PDB.

Database tier

PostgreSQL is the single stateful component, so the availability of your deployment is the availability of your database. Use a managed failover-capable Postgres: Aurora, RDS Multi-AZ, or an equivalent (Patroni/CloudNativePG for self-managed clusters; any failover-capable PostgreSQL 17+ works - nothing here is AWS-specific).

During a failover, in-flight writes fail and clients receive errors until the new primary is up; the application's connection pool recovers on its own as connections are checked out. Chain integrity is unaffected by design - an append either committed as one transaction or it did not. There is no partial chain state to repair, and verification after a failover requires nothing beyond the normal audit runbook.

The bundled Postgres (the Compose overlay docker-compose.postgres.yml, or the chart's postgres.bundled deployment) is a single-node convenience, explicitly not an HA database. Running on an external database is the Enterprise license boundary: the Developer Edition license covers production use on the bundled single-node Postgres, and pointing the Server at Aurora or any other external Postgres is Enterprise scope - which is why high availability is an Enterprise-tier line item. The boundary is contractual, not technical: the Server detects the external-database case and logs a persistent licensing notice, but never blocks or degrades.

External anchoring is unaffected by all of this: anchors are written to object storage outside the database and remain valid ground truth across any failover.

The supported HA shape, summarized

API: replicaCount: 2+, RollingUpdate, RATE_LIMIT_STORE=postgresql, anti-affinity across nodes.
Worker: replicaCount: 1 default, 2+ for throughput - safe either way.
PostgreSQL: managed failover (Enterprise license), TLS required in production.
Anchoring enabled, to a separately-administered bucket.

What is not supported today: multi-region active-active against one logical chain, and any deployment shape where two Servers write the same database schema independently. One Server, one database; scale the stateless tiers; let the database layer own failover. For multi-site topologies, run sovereign Servers per site and connect them with federation.

Validated against API v1.1.0 on 2026-06-10. Worker-scaling claim run-tested: three concurrent worker instances against one database produced exactly one scheduled job per cron tick.

Reviewed for API v1.3.3 on 2026-07-20: 1.3.3 (audit-vault chain-scan detection plus opt-in verdict per-actor signatures) changes no deployment topology; the one-Server-one-database model and worker-scaling behavior are unchanged.