Day-2 Operations

This page covers the recurring work after install: watching health, scraping metrics, rotating the signing key, keeping partitions ahead of growth, reloading config, and upgrading. Two facts shape everything below.

"detail": "Action 'ROTATE_VAULT_SIGNING_KEY' requires platform role; caller resolved as 'admin'.",
"recoveryHint": "This action is platform-scoped ... mint a platform-role key via POST /v1/admin/api-keys (platform-only)."

Health and readiness probes

Three unauthenticated endpoints, shaped for orchestration probes:

curl -s "$AGLEDGER_URL/livez"          # liveness: is the process up?
curl -s "$AGLEDGER_URL/readyz"         # readiness: can it serve (DB reachable)?
curl -s "$AGLEDGER_URL/health"         # detailed status + version
{"status":"alive","timestamp":"2026-05-26T01:21:29.989Z"}
{"status":"ready","version":"0.25.4","timestamp":"2026-05-26T01:21:29.996Z"}
{"status":"ok","version":"0.25.4","timestamp":"2026-05-26T01:21:29.977Z"}

Wire livez to your liveness probe and readyz to your readiness probe. /health/ready is an alias of readyz. None require auth, so probes need no credentials.

The aggregate health view

GET /v1/admin/system-health (platform key) is the one-call operator summary: database latency and pool, every pg-boss queue, and process memory.

curl -s -H "Authorization: Bearer $AGLEDGER_PLATFORM_KEY" "$AGLEDGER_URL/v1/admin/system-health"
{
  "status": "healthy",
  "uptime": 810.18,
  "database": { "status": "healthy", "latencyMs": 6.43, "pool": { "total": 2, "idle": 2, "waiting": 0 } },
  "queues": {
    "phase2-verification": { "waiting": 0, "active": 0, "delayed": 0, "failed": 0 },
    "webhook-delivery":     { "waiting": 0, "active": 0, "delayed": 0, "failed": 0 },
    "maintenance":          { "waiting": 0, "active": 0, "delayed": 0, "failed": 0 },
    "federation-outbound":  { "waiting": 0, "active": 0, "delayed": 0, "failed": 0 },
    "federation-outbound-dlq": { "waiting": 0, "active": 0, "delayed": 0, "failed": 0 }
  },
  "process": { "rssMb": 168.96, "heapUsedMb": 71.96, "heapTotalMb": 76.66 }
}

A growing queues.*.failed count or a climbing pool.waiting is your earliest signal of trouble.

Metrics

GET /metrics exposes Prometheus metrics (unauthenticated; restrict at your ingress). All series are prefixed agledger_. The ones worth alerting on:

| Metric | Watch for | |---|---| | agledger_vault_integrity_check_results_total{result="broken"} | Any increase — a chain failed periodic verification | | agledger_db_pool_waiting_connections | Sustained nonzero — pool saturation | | agledger_pgboss_queue_size{queue=~".*-dlq",state="total"} | Growth — jobs dead-lettering into a DLQ | | agledger_pg_listener_reconnect_failures_total | Increase — cross-replica cache coherence degraded | | agledger_vault_checkpoint_skipped_broken_total | Increase — a record went un-anchored | | agledger_outbound_ssrf_blocked_total | Increase — outbound calls hitting the SSRF guard |

curl -s "$AGLEDGER_URL/metrics" | grep agledger_vault_integrity_check_results_total

Note: agledger_partition_runway_days (next section) is exposed by the worker process, not the API process. Scrape both processes, not just the API.

Signing-key rotation

Rotation is the load-bearing day-2 task. The guarantee that makes it safe:

Rotating the signing key never breaks verification of already-signed records. Retired keys stay in the published registry, so a record signed under an old key still verifies after any number of rotations. No re-signing, no downtime.

How rotation works

The engine signs with the key in VAULT_SIGNING_KEY. To rotate: generate a new key, set it as VAULT_SIGNING_KEY, move the prior key to VAULT_SIGNING_KEY_PREVIOUS, and restart the process. On boot the engine retires the old active key in the registry and promotes the new one:

INFO: Retired previous active signing key during bootstrap
INFO: Bootstrapped active signing key into registry

POST /v1/admin/vault/signing-keys/rotate (platform key) reconciles the registry to the env-configured key. In a normally-booted process the boot step has already promoted it, so the endpoint reports already_active — use it to confirm, not to mint:

{ "previousKeyId": null, "newKeyId": "6a639248683aab56", "status": "already_active" }

After rotation, both keys appear at GET /v1/verification-keys — the new one active, the prior one retired but still resolvable:

curl -s "$AGLEDGER_URL/v1/verification-keys"
6a639248683aab56 | active  | activated 2026-05-26 | retired null
affc2b9bfb22144e | retired | activated 2026-05-26 | retired 2026-05-26

Proving the guarantee

Records signed before the rotation must still verify. Dump the vault and run the offline verifier (see the audit runbook) — the dump now carries two signing keys and entries signed by both:

DATABASE_URL=postgresql://… pnpm vault:dump ./dump   # "vault_signing_keys": 2
agledger-verify ./dump
[PASS] AGLedger offline verification

audit_vault chain
  records    : 5
  entries     : 5
  failures    : 0

Zero failures across records signed by the retired key and the active key. That is the guarantee, demonstrated end to end.

Partition maintenance

Several high-volume tables are range-partitioned by month, each with a DEFAULT catch-all partition so a write never fails for lack of a partition. The worker pre-creates upcoming partitions and exposes runway as a gauge (agledger_partition_runway_days). You can also query the source function directly:

psql "$DATABASE_URL" -c "SELECT table_name, runway_days, default_rows FROM partition_runway();"
     table_name     | runway_days | default_rows
--------------------+-------------+--------------
 audit_vault        |         585 |            0
 events             |         585 |            0
 webhook_deliveries |         585 |            0
 system_audit_log   |          98 |            0

runway_days is days until the latest pre-created partition is reached; default_rows should stay 0 — a nonzero value means writes are landing in the DEFAULT partition and the worker is falling behind. Alert on low runway_days and on default_rows > 0.

Config-as-code hot reload

If you run with PROVISIONING_CONFIG_PATH set, orgs, agents, webhooks, and contract schemas are declared in YAML and reconciled on every boot. Reload changes without a restart via SIGHUP or POST /v1/admin/provisioning/reload (platform key). Check current state first:

curl -s -H "Authorization: Bearer $AGLEDGER_PLATFORM_KEY" "$AGLEDGER_URL/v1/admin/provisioning/status"
{"configured":true,"configPath":"/etc/agledger/provisioning","dryRun":false,"prune":false,"lastReloadAt":"2026-05-26T01:26:43.834Z","managed":{"orgs":1,"agents":2,"webhooks":0,"schemas":2}}
curl -s -X POST -H "Authorization: Bearer $AGLEDGER_PLATFORM_KEY" "$AGLEDGER_URL/v1/admin/provisioning/reload"
{
  "orgs":    { "created": 0, "updated": 1, "pruned": 0 },
  "agents":  { "created": 0, "updated": 2, "pruned": 0 },
  "schemas": { "created": 0, "updated": 2, "pruned": 0 },
  "apiKeys": { "created": 0, "skipped": 3, "generated": [] },
  "errors": [
    { "resource": "config", "error": "webhooks/example.yaml: Environment variable ACME_WEBHOOK_SECRET is not set and has no default" }
  ]
}

Reload is idempotent: unchanged resources count as updated, existing keys as skipped. It is also fail-open — a single invalid file (here, an unset ${ACME_WEBHOOK_SECRET} substitution) is reported in errors[] while every valid resource still applies. Newly minted keys appear in apiKeys.generated[] with their raw value exactly once, in that response body — capture them then.

Version upgrades

Upgrade with deploy/scripts/upgrade.sh (see the install runbook for the full procedure and the air-gapped path). Migrations run automatically and are advisory-locked and checksum-verified, so a migration runs once and only once even across replicas, and a tampered or reordered migration is refused rather than applied. For diagnostics to attach to a support request, deploy/scripts/support-bundle.sh collects health, config (secrets redacted), and recent operational events.

Restart the process after swapping the image. Already-signed records verify across versions — the chain format is stable and historical keys resolve.


Validated against API v0.25.4 on 2026-05-25.