Day-2 Operations
This page covers the recurring work after install: watching health, scraping metrics, rotating the signing key, keeping partitions ahead of growth, reloading config, and upgrading. Two facts shape everything below.
- Two processes. A Server runs an API process and a worker process. The API serves requests; the worker runs scheduled maintenance (partition upkeep, recovery sweeps) and exposes its own metrics. Some gauges live only on the worker — noted where it matters.
- Two key roles. Org-scoped
adminkeys govern one org. Cross-org operator surfaces (system-health, signing-key rotation, provisioning) require aplatformkey. An admin key calling these gets a403that says so:
"detail": "Action 'ROTATE_VAULT_SIGNING_KEY' requires platform role; caller resolved as 'admin'.",
"recoveryHint": "This action is platform-scoped ... mint a platform-role key via POST /v1/admin/api-keys (platform-only)."
Health and readiness probes
Three unauthenticated endpoints, shaped for orchestration probes:
curl -s "$AGLEDGER_URL/livez" # liveness: is the process up?
curl -s "$AGLEDGER_URL/readyz" # readiness: can it serve (DB reachable)?
curl -s "$AGLEDGER_URL/health" # detailed status + version
{"status":"alive","timestamp":"2026-05-26T01:21:29.989Z"}
{"status":"ready","version":"0.25.4","timestamp":"2026-05-26T01:21:29.996Z"}
{"status":"ok","version":"0.25.4","timestamp":"2026-05-26T01:21:29.977Z"}
Wire livez to your liveness probe and readyz to your readiness probe. /health/ready is an
alias of readyz. None require auth, so probes need no credentials.
The aggregate health view
GET /v1/admin/system-health (platform key) is the one-call operator summary: database latency and
pool, every pg-boss queue, and process memory.
curl -s -H "Authorization: Bearer $AGLEDGER_PLATFORM_KEY" "$AGLEDGER_URL/v1/admin/system-health"
{
"status": "healthy",
"uptime": 810.18,
"database": { "status": "healthy", "latencyMs": 6.43, "pool": { "total": 2, "idle": 2, "waiting": 0 } },
"queues": {
"phase2-verification": { "waiting": 0, "active": 0, "delayed": 0, "failed": 0 },
"webhook-delivery": { "waiting": 0, "active": 0, "delayed": 0, "failed": 0 },
"maintenance": { "waiting": 0, "active": 0, "delayed": 0, "failed": 0 },
"federation-outbound": { "waiting": 0, "active": 0, "delayed": 0, "failed": 0 },
"federation-outbound-dlq": { "waiting": 0, "active": 0, "delayed": 0, "failed": 0 }
},
"process": { "rssMb": 168.96, "heapUsedMb": 71.96, "heapTotalMb": 76.66 }
}
A growing queues.*.failed count or a climbing pool.waiting is your earliest signal of trouble.
Metrics
GET /metrics exposes Prometheus metrics (unauthenticated; restrict at your ingress). All series
are prefixed agledger_. The ones worth alerting on:
| Metric | Watch for |
|---|---|
| agledger_vault_integrity_check_results_total{result="broken"} | Any increase — a chain failed periodic verification |
| agledger_db_pool_waiting_connections | Sustained nonzero — pool saturation |
| agledger_pgboss_queue_size{queue=~".*-dlq",state="total"} | Growth — jobs dead-lettering into a DLQ |
| agledger_pg_listener_reconnect_failures_total | Increase — cross-replica cache coherence degraded |
| agledger_vault_checkpoint_skipped_broken_total | Increase — a record went un-anchored |
| agledger_outbound_ssrf_blocked_total | Increase — outbound calls hitting the SSRF guard |
curl -s "$AGLEDGER_URL/metrics" | grep agledger_vault_integrity_check_results_total
Note: agledger_partition_runway_days (next section) is exposed by the worker process, not the
API process. Scrape both processes, not just the API.
Signing-key rotation
Rotation is the load-bearing day-2 task. The guarantee that makes it safe:
Rotating the signing key never breaks verification of already-signed records. Retired keys stay in the published registry, so a record signed under an old key still verifies after any number of rotations. No re-signing, no downtime.
How rotation works
The engine signs with the key in VAULT_SIGNING_KEY. To rotate: generate a new key, set it as
VAULT_SIGNING_KEY, move the prior key to VAULT_SIGNING_KEY_PREVIOUS, and restart the process.
On boot the engine retires the old active key in the registry and promotes the new one:
INFO: Retired previous active signing key during bootstrap
INFO: Bootstrapped active signing key into registry
POST /v1/admin/vault/signing-keys/rotate (platform key) reconciles the registry to the
env-configured key. In a normally-booted process the boot step has already promoted it, so the
endpoint reports already_active — use it to confirm, not to mint:
{ "previousKeyId": null, "newKeyId": "6a639248683aab56", "status": "already_active" }
After rotation, both keys appear at GET /v1/verification-keys — the new one active, the prior one
retired but still resolvable:
curl -s "$AGLEDGER_URL/v1/verification-keys"
6a639248683aab56 | active | activated 2026-05-26 | retired null
affc2b9bfb22144e | retired | activated 2026-05-26 | retired 2026-05-26
Proving the guarantee
Records signed before the rotation must still verify. Dump the vault and run the offline verifier (see the audit runbook) — the dump now carries two signing keys and entries signed by both:
DATABASE_URL=postgresql://… pnpm vault:dump ./dump # "vault_signing_keys": 2
agledger-verify ./dump
[PASS] AGLedger offline verification
audit_vault chain
records : 5
entries : 5
failures : 0
Zero failures across records signed by the retired key and the active key. That is the guarantee, demonstrated end to end.
Partition maintenance
Several high-volume tables are range-partitioned by month, each with a DEFAULT catch-all partition
so a write never fails for lack of a partition. The worker pre-creates upcoming partitions and
exposes runway as a gauge (agledger_partition_runway_days). You can also query the source
function directly:
psql "$DATABASE_URL" -c "SELECT table_name, runway_days, default_rows FROM partition_runway();"
table_name | runway_days | default_rows
--------------------+-------------+--------------
audit_vault | 585 | 0
events | 585 | 0
webhook_deliveries | 585 | 0
system_audit_log | 98 | 0
runway_days is days until the latest pre-created partition is reached; default_rows should stay
0 — a nonzero value means writes are landing in the DEFAULT partition and the worker is falling
behind. Alert on low runway_days and on default_rows > 0.
Config-as-code hot reload
If you run with PROVISIONING_CONFIG_PATH set, orgs, agents, webhooks, and contract schemas are
declared in YAML and reconciled on every boot. Reload changes without a restart via SIGHUP or
POST /v1/admin/provisioning/reload (platform key). Check current state first:
curl -s -H "Authorization: Bearer $AGLEDGER_PLATFORM_KEY" "$AGLEDGER_URL/v1/admin/provisioning/status"
{"configured":true,"configPath":"/etc/agledger/provisioning","dryRun":false,"prune":false,"lastReloadAt":"2026-05-26T01:26:43.834Z","managed":{"orgs":1,"agents":2,"webhooks":0,"schemas":2}}
curl -s -X POST -H "Authorization: Bearer $AGLEDGER_PLATFORM_KEY" "$AGLEDGER_URL/v1/admin/provisioning/reload"
{
"orgs": { "created": 0, "updated": 1, "pruned": 0 },
"agents": { "created": 0, "updated": 2, "pruned": 0 },
"schemas": { "created": 0, "updated": 2, "pruned": 0 },
"apiKeys": { "created": 0, "skipped": 3, "generated": [] },
"errors": [
{ "resource": "config", "error": "webhooks/example.yaml: Environment variable ACME_WEBHOOK_SECRET is not set and has no default" }
]
}
Reload is idempotent: unchanged resources count as updated, existing keys as skipped. It is also
fail-open — a single invalid file (here, an unset ${ACME_WEBHOOK_SECRET} substitution) is reported
in errors[] while every valid resource still applies. Newly minted keys appear in
apiKeys.generated[] with their raw value exactly once, in that response body — capture them then.
Version upgrades
Upgrade with deploy/scripts/upgrade.sh (see the install runbook for the full
procedure and the air-gapped path). Migrations run automatically and are advisory-locked and
checksum-verified, so a migration runs once and only once even across replicas, and a tampered or
reordered migration is refused rather than applied. For diagnostics to attach to a support request,
deploy/scripts/support-bundle.sh collects health, config (secrets redacted), and recent
operational events.
Restart the process after swapping the image. Already-signed records verify across versions — the chain format is stable and historical keys resolve.
Validated against API v0.25.4 on 2026-05-25.