High availability

An AGLedger deployment has three tiers with three different availability stories:

| Tier | State | HA approach | |---|---|---| | API | Stateless | Scale replicas; one env footgun (RATE_LIMIT_STORE) | | Worker | Stateless (queue in Postgres) | Scale replicas; no special handling | | PostgreSQL | All of it | Managed failover (Aurora, RDS Multi-AZ, Patroni) |

The chain lives entirely in PostgreSQL. Application pods hold no durable state — every tamper-evidence guarantee survives any pod being killed at any moment, because chain appends are single transactions serialized per record by a database advisory lock. There is no HA mode to turn on in the application; there is a deployment shape to configure.

API tier

API pods are stateless and scale horizontally. Three changes from the single-node default, all in the chart's values:

api:
  replicaCount: 2          # or enable the HPA
  strategy:
    type: RollingUpdate    # default is Recreate, sized for single-node clusters

extraEnv:                  # required when running more than one API replica
  - name: RATE_LIMIT_STORE
    value: postgresql

The default rate-limit store is in-process memory, which is correct for one replica and silently wrong for N — each replica enforces its own counters, multiplying every effective limit by N. The PostgreSQL store gives cluster-wide limits. (If the store cannot reach a database pool it falls back to in-memory rather than failing requests, so set it and confirm rather than assume.)

Everything else is already multi-replica-safe: signing-key and discovery caches invalidate across replicas via Postgres LISTEN/NOTIFY, and migrations cannot race. In the chart, migrations run as a pre-install/pre-upgrade Job — one runner per release, not one per pod. The optional startup runner (RUN_MIGRATIONS, off by default) takes a global advisory lock and verifies migration checksums; a pod that loses the lock exits and restarts rather than running migrations twice.

The chart ships an HPA, a PodDisruptionBudget, and a commented pod-anti-affinity example (spread replicas across nodes) for the API deployment — see values.yaml.

Worker tier

Worker replicas scale the same way (worker.replicaCount), with no special configuration:

A single worker replica is fully functional and is the right default; add replicas for queue throughput or for zero-gap coverage across pod restarts. The chart ships a worker HPA and PDB.

Database tier

PostgreSQL is the single stateful component, so the availability of your deployment is the availability of your database. Use a managed failover-capable Postgres: Aurora, RDS Multi-AZ, or an equivalent (Patroni/CloudNativePG for self-managed clusters; any failover-capable PostgreSQL 17+ works — nothing here is AWS-specific).

During a failover, in-flight writes fail and clients receive errors until the new primary is up; the application's connection pool recovers on its own as connections are checked out. Chain integrity is unaffected by design — an append either committed as one transaction or it did not. There is no partial chain state to repair, and verification after a failover requires nothing beyond the normal audit runbook.

The bundled Postgres (the Compose overlay docker-compose.postgres.yml, or the chart's postgres.bundled deployment) is a single-node convenience, explicitly not an HA database. Running on an external database is the Enterprise license boundary: the Developer Edition license covers production use on the bundled single-node Postgres, and pointing the Server at Aurora or any other external Postgres is Enterprise scope — which is why high availability is an Enterprise-tier line item. The boundary is contractual, not technical: the Server detects the external-database case and logs a persistent licensing notice, but never blocks or degrades.

External anchoring is unaffected by all of this: anchors are written to object storage outside the database and remain valid ground truth across any failover.

The supported HA shape, summarized

What is not supported today: multi-region active-active against one logical chain, and any deployment shape where two Servers write the same database schema independently. One Server, one database; scale the stateless tiers; let the database layer own failover. For multi-site topologies, run sovereign Servers per site and connect them with federation.


Validated against API v1.0.0 on 2026-06-10. Worker-scaling claim run-tested: three concurrent worker instances against one database produced exactly one scheduled job per cron tick.