Self-Hosted Operations
Day-2 operations for self-hosted AGLedger: upgrades, backup/restore, rollback, secret rotation, observability, troubleshooting, and high availability.
This guide assumes AGLedger is installed and running. All commands use the scripts included in the self-hosted distribution. Paths are relative to the self-hosted repository root.
1. Upgrading
The upgrade.sh script automates the full upgrade workflow: backup, image pull, migration, restart, and verification.
Basic upgrade
./upgrade.sh 1.3.0
The script performs these steps in order:
- Detects the current running version (from
.env,VERSIONfile, or Docker image tag) - Creates a pre-upgrade backup (calls
./scripts/backup.sh) - Authenticates with the container registry (ECR)
- Pulls the target version image
- Stops the worker to prevent job processing during migration
- Runs database migrations using the new image
- Updates
AGLEDGER_VERSIONin.envandVERSIONfile - Restarts all services
- Waits for the API to become healthy (up to 30 seconds)
- Runs preflight checks
- Verifies the new version via
/healthand/conformance
Skipping the pre-upgrade backup
./upgrade.sh 1.3.0 --skip-backup
Not recommended for production. Use only if you have a separate backup mechanism (e.g., RDS automated snapshots).
Deprecated environment variables
The upgrade script automatically handles removed variables:
AGLEDGER_LICENSE_MODE(removed in v0.15.0) — the server refuses to start if this is set. The upgrade script comments it out automatically. Licensing is now automatic when a license key is present.AGLEDGER_RELEASE_DATE— baked into the Docker image at build time. If a previous install wrote it into.env, the script comments it out so the image-baked value takes precedence.
Verifying after upgrade
curl -s http://localhost:3001/health | jq .
curl -s http://localhost:3001/conformance | jq .version
docker compose -f docker-compose/docker-compose.yml ps
2. Backup and Restore
2.1 Creating a backup
./scripts/backup.sh
This creates a timestamped tarball containing a PostgreSQL custom-format dump (pg_dump -Fc). The default location is ./backup/backup-YYYY-MM-DD-HHMMSS.tar.gz.
Retention: By default, the script keeps the 7 most recent backups and deletes older ones.
# Keep last 14 backups
./scripts/backup.sh --keep 14
# Custom backup directory
BACKUP_DIR=/mnt/backups ./scripts/backup.sh
Bundled vs. external database: The script auto-detects your database mode from DATABASE_URL in .env. For bundled PostgreSQL, it uses docker compose exec postgres pg_dump. For external databases (Aurora, RDS, Cloud SQL), it calls pg_dump directly — ensure the PostgreSQL client tools are installed on the host.
2.2 Restoring from backup
./scripts/restore.sh backup/backup-2026-03-14-120000.tar.gz
The restore script:
- Prompts for confirmation (the restore replaces all data in the database)
- Extracts the backup tarball to a temporary directory
- Stops the API and worker services
- Terminates active database connections
- Drops and recreates the database
- Runs
pg_restore --no-owner --no-aclfrom the dump - Restarts all services with
docker compose up -d --wait - Runs preflight checks to verify the restore
For non-interactive use (CI, automation):
./scripts/restore.sh --non-interactive backup/backup-2026-03-14-120000.tar.gz
External database restore: For external databases, the restore script uses psql and pg_restore directly. The database user must have CREATEDB privilege (or the ability to drop and recreate the target database). If DATABASE_URL_MIGRATE is set, the script uses that connection (which typically has owner-role privileges for DDL).
2.3 Managed database backups
If you run AGLedger on a managed PostgreSQL service, use the provider's native backup tools alongside (or instead of) the script-based backups:
| Provider | Backup mechanism | Notes | |----------|-----------------|-------| | AWS RDS / Aurora | Automated snapshots + point-in-time recovery | Enable automated backups; set retention to at least 7 days | | Google Cloud SQL | Automated backups + on-demand backups | Enable in the instance configuration | | Azure Database for PostgreSQL | Automated backups (enabled by default) | Retention configurable 7-35 days |
Managed backups provide continuous protection with near-zero RPO. The script-based backups (pg_dump) are useful for cross-version migration and portable archives.
2.4 Post-restore verification
After restoring from any backup method, verify the instance:
# 1. Health check
curl -s http://localhost:3001/health | jq .
# 2. Vault chain integrity scan
curl -X POST http://localhost:3001/v1/admin/vault/scan \
-H "Authorization: Bearer $PLATFORM_KEY" \
-H "Content-Type: application/json" \
-d '{}'
# 3. If using YAML provisioning, reload to reconcile state
curl -X POST http://localhost:3001/v1/admin/provisioning/reload \
-H "Authorization: Bearer $PLATFORM_KEY" \
-H "Content-Type: application/json"
# 4. Run a smoke-test lifecycle (create mandate, submit receipt, verify settlement)
The vault scan walks every mandate's hash chain and reports any broken chains. After a clean restore, all chains should be intact up to the restore point — PostgreSQL's transactional consistency guarantees that partial vault entries cannot exist.
3. Rollback
The rollback.sh script restores the database from backup and reverts to a previous image version. Database migrations are forward-only, so rollback means restoring the full database state.
Auto-detected rollback
./scripts/rollback.sh
Without arguments, the script:
- Uses the most recent backup in
./backup/ - Reads the pre-upgrade version from
./backup/.pre-upgrade-version(written automatically byupgrade.sh) - Prompts for confirmation
Explicit rollback
# Specify the target version
./scripts/rollback.sh --version 0.9.2
# Specify a specific backup file
./scripts/rollback.sh --backup backup/backup-2026-03-25-120000.tar.gz
# Non-interactive (for automation)
./scripts/rollback.sh --non-interactive --version 0.9.2
What rollback does
- Locates the backup (most recent, or specified with
--backup) - Resolves the target version (from
.pre-upgrade-version,--version, or interactive prompt) - Restores the database by calling
./scripts/restore.sh --non-interactive - Updates
AGLEDGER_VERSIONin.envandVERSIONfile - Pulls the target version image (falls back to cached image if pull fails)
- Restarts all services
Data loss warning: All data written after the backup will be lost. This includes mandates, receipts, vault entries, and webhook deliveries created between the backup and the rollback.
4. Secret Rotation
Two secrets may need rotation: API_KEY_SECRET (used for HMAC hashing of API keys) and VAULT_SIGNING_KEY (used for Ed25519 signatures on audit vault entries).
Run the interactive rotation guide:
./scripts/rotate-secrets.sh # Prompts for which secret
./scripts/rotate-secrets.sh api-key-secret # Rotate API key secret directly
./scripts/rotate-secrets.sh vault-signing-key # Rotate vault signing key directly
4.1 API_KEY_SECRET rotation
This is a 6-step process with a dual-secret window to avoid downtime:
- Save the current
API_KEY_SECRETasAPI_KEY_SECRET_PREVIOUSin.env - Generate a new
API_KEY_SECRET(64-character hex viaopenssl rand -hex 32) - Restart services — dual-secret mode activates (both old and new keys work)
- Re-hash all API keys with the new secret (
rehash-api-keys.js). This step is irreversible. - Restart services to verify
- Remove
API_KEY_SECRET_PREVIOUSfrom.envand restart (old hashes stop working)
Between steps 3 and 6, both the old and new secrets are active. This allows you to pause and verify that integrations still work before committing.
4.2 VAULT_SIGNING_KEY rotation
Vault signing key rotation is simpler because old signatures remain valid — they were signed at creation time and are not re-verified against the current key.
- Generate a new Ed25519 signing key (via
generate-signing-key.jsinside the container) - Update
VAULT_SIGNING_KEYin.env - Restart services
New vault entries will be signed with the new key. Old entries retain their original signatures. Keep a record of the old key if you need to verify historical signatures.
For detailed key lifecycle management, see the Vault Signing Key Guide.
5. Observability
5.1 Monitoring stack (Docker Compose)
The Docker Compose configuration includes an optional monitoring profile with OTel Collector, Jaeger, Prometheus, and Grafana:
docker compose --profile monitoring \
-f docker-compose.yml \
-f docker-compose.postgres.yml \
up -d --wait
This starts four additional services:
| Service | Port | Purpose |
|---------|------|---------|
| OTel Collector | 4317 (gRPC) | Receives traces from AGLedger, exports to Jaeger |
| Jaeger | 16686 | Distributed trace viewer (Jaeger v2) |
| Prometheus | 9090 | Metrics scraping and storage |
| Grafana | 3003 | Dashboard UI (default password: admin) |
5.2 Health endpoints
| Endpoint | Port | Auth | Use for |
|----------|------|------|---------|
| GET /health | 3001 (API), 3002 (Worker) | None | Liveness probes. Returns {"status":"ok"}. |
| GET /health/ready | 3001 (API), 3002 (Worker) | None | Readiness probes. Returns 503 if database is unreachable. |
| GET /status | 3001 | None | Public status page with database health check. |
| GET /v1/admin/system-health | 3001 | Platform key | Detailed system health: DB latency, pool stats, memory, pg-boss queue counts. |
For Kubernetes deployments:
livenessProbe:
httpGet:
path: /health
port: 3000
initialDelaySeconds: 5
periodSeconds: 10
readinessProbe:
httpGet:
path: /health/ready
port: 3000
initialDelaySeconds: 5
periodSeconds: 10
5.3 Prometheus metrics
Both the API (port 3000) and Worker (port 3001) expose GET /metrics in Prometheus exposition format. No authentication required — intended for internal network scraping only.
Business metrics:
| Metric | Type | Description |
|--------|------|-------------|
| agledger_mandate_transitions_total | Counter | State transitions, labeled from_status and to_status |
| agledger_verification_duration_seconds | Histogram | Phase 2 verification latency by contract_type and result |
| agledger_worker_jobs_processed_total | Counter | Worker job completions by queue and status |
Infrastructure metrics:
| Metric | Type | Description |
|--------|------|-------------|
| agledger_http_request_duration_seconds | Histogram | HTTP latency by method, route, status_code |
| agledger_db_pool_total_connections | Gauge | Total PostgreSQL pool connections |
| agledger_db_pool_idle_connections | Gauge | Idle connections available |
| agledger_db_pool_waiting_connections | Gauge | Clients waiting for a connection (non-zero = pool exhaustion) |
Cache metrics:
| Metric | Type | Description |
|--------|------|-------------|
| agledger_auth_cache_hits_total | Counter | API key auth cache hits |
| agledger_auth_cache_misses_total | Counter | Auth cache misses (DB lookup required) |
| agledger_schema_cache_hits_total | Counter | Contract type schema cache hits |
| agledger_schema_cache_misses_total | Counter | Schema cache misses |
Process metrics (automatic, prefixed agledger_): CPU usage, resident memory, event loop lag, GC duration, active handles/requests.
5.4 Useful PromQL queries
# P95 request latency
histogram_quantile(0.95, rate(agledger_http_request_duration_seconds_bucket[5m]))
# Error rate (5xx)
sum(rate(agledger_http_request_duration_seconds_count{status_code=~"5.."}[5m]))
/
sum(rate(agledger_http_request_duration_seconds_count[5m]))
# Mandate transitions by destination state
sum by (to_status) (rate(agledger_mandate_transitions_total[5m]))
# DB pool saturation (alert if > 0)
agledger_db_pool_waiting_connections > 0
# Auth cache hit rate (target > 90%)
rate(agledger_auth_cache_hits_total[5m])
/
(rate(agledger_auth_cache_hits_total[5m]) + rate(agledger_auth_cache_misses_total[5m]))
# Worker job failure rate by queue
sum by (queue) (rate(agledger_worker_jobs_processed_total{status="failure"}[5m]))
/
sum by (queue) (rate(agledger_worker_jobs_processed_total[5m]))
5.5 Recommended Alertmanager rules
groups:
- name: agledger
rules:
- alert: HighErrorRate
expr: |
sum(rate(agledger_http_request_duration_seconds_count{status_code=~"5.."}[5m]))
/
sum(rate(agledger_http_request_duration_seconds_count[5m]))
> 0.05
for: 5m
labels:
severity: warning
annotations:
summary: "AGLedger API error rate above 5%"
- alert: VerificationSlow
expr: |
histogram_quantile(0.95, rate(agledger_verification_duration_seconds_bucket[5m])) > 5
for: 10m
labels:
severity: warning
annotations:
summary: "Phase 2 verification P95 latency above 5 seconds"
- alert: DBPoolExhaustion
expr: agledger_db_pool_waiting_connections > 0
for: 2m
labels:
severity: critical
annotations:
summary: "Database connection pool has waiting clients"
- alert: WorkerJobFailures
expr: |
rate(agledger_worker_jobs_processed_total{status="failure"}[5m]) > 0.1
for: 5m
labels:
severity: warning
annotations:
summary: "Worker job failure rate elevated on queue {{ $labels.queue }}"
- alert: AuthCacheLowHitRate
expr: |
rate(agledger_auth_cache_hits_total[5m])
/
(rate(agledger_auth_cache_hits_total[5m]) + rate(agledger_auth_cache_misses_total[5m]))
< 0.8
for: 10m
labels:
severity: warning
annotations:
summary: "Auth cache hit rate below 80%"
5.6 Distributed tracing (OpenTelemetry)
Tracing is disabled by default. Set OTEL_EXPORTER_OTLP_ENDPOINT to activate it.
| Variable | Default | Description |
|----------|---------|-------------|
| OTEL_EXPORTER_OTLP_ENDPOINT | (unset) | OTLP gRPC endpoint (e.g., http://jaeger:4317). Setting this activates tracing. |
| OTEL_PROVIDER | generic | generic (W3C Trace-Context) or xray (AWS X-Ray ID format). |
| OTEL_SERVICE_NAME | agledger-api | Service name in trace metadata. Worker uses agledger-worker. |
Auto-instrumented: HTTP requests (Fastify), PostgreSQL queries, outbound HTTP (webhook delivery). Custom span attributes include mandateId, agentId, enterpriseId, contractType.
5.7 Logging
AGLedger uses pino for structured JSON logging. Every log line is a JSON object with level, time, msg, and contextual fields (mandateId, reqId, etc.). Sensitive fields (API keys, secrets, passwords, tokens) are automatically redacted.
| Variable | Default | Description |
|----------|---------|-------------|
| LOG_LEVEL | info | Minimum level: trace, debug, info, warn, error, fatal |
5.8 SIEM integration
AGLedger can export audit events to your SIEM. Configure via environment variables:
| Variable | Default | Description |
|----------|---------|-------------|
| SIEM_ENABLED | false | Enable SIEM event export |
| SIEM_FORMAT | ocsf | Export format: ocsf or raw |
| SIEM_FILE_ENABLED | true | Write events to a local file (when SIEM is enabled) |
| SIEM_FILE_PATH | /var/log/agledger/siem.ndjson | File path for NDJSON output |
| SIEM_HTTP_ENABLED | false | Push events to an HTTP endpoint |
| SIEM_HTTP_URL | (empty) | HTTP endpoint URL |
| SIEM_HTTP_AUTH_HEADER | (empty) | Authorization header value for HTTP push |
| SIEM_BATCH_SIZE | 50 | Events per batch |
| SIEM_FLUSH_INTERVAL_MS | 5000 | Flush interval in milliseconds |
6. Support Bundle
When contacting AGLedger support, generate a diagnostic bundle that collects system state with all secrets automatically redacted.
6.1 CLI support bundle
./scripts/support-bundle.sh
The script collects:
- Redacted
.env— passwords, secrets, keys, tokens, and database URLs are replaced with[REDACTED] - Docker Compose state —
docker compose psoutput - Container logs — last 1000 lines from all services
- PostgreSQL diagnostics — schema dump (no data), table row counts, migration state
- System info — Docker version, OS info, memory, disk usage
- Health endpoints — responses from
/health,/conformance,/status - AGLedger version — version, database mode, image name
The bundle is saved as ./support-bundle-YYYY-MM-DD-HHMMSS.tar.gz. Send it to support@agledger.ai.
6.2 API support bundle
The admin API provides a JSON support bundle with 9 sections:
curl -s http://localhost:3001/v1/admin/support-bundle \
-H "Authorization: Bearer $PLATFORM_KEY" | jq .
Response sections:
| Section | Contents |
|---------|----------|
| manifest | Bundle version, generation timestamp, section index |
| version | AGLedger version, Node.js version, operating mode (standalone/gateway/hub) |
| license | License tier, status, features |
| health | Database connectivity, component status |
| authCache | Cache size, max capacity, hit rate |
| config | Runtime configuration (secrets excluded) |
| database | PostgreSQL version, migration state, table sizes |
| environment | Platform, CPU count, memory |
| guidance | List of items not included in the bundle and why |
The API bundle requires a platform key. Enterprise and agent keys receive 403.
7. Troubleshooting Runbook
7.1 Connection pooler incompatibility
Symptom: Jobs silently stop processing. Webhook deliveries stall. Worker logs show no errors but no activity.
Cause: AGLedger uses pg-boss, which requires PostgreSQL LISTEN/NOTIFY and session-level advisory locks. Transaction-mode connection poolers break both features.
Incompatible poolers:
- AWS RDS Proxy — causes connection pinning, breaks advisory locks
- PgBouncer in transaction mode —
LISTEN/NOTIFYsilently fails - Cloud SQL managed pooling — uses transaction mode by default
Fix: Connect AGLedger directly to the PostgreSQL instance, bypassing any connection pooler. If you must use a pooler for other applications, configure AGLedger's DATABASE_URL to point to the direct endpoint.
7.2 SSL/TLS setup issues
Symptom: ECONNREFUSED or SSL handshake errors when connecting to a managed database.
For AWS RDS/Aurora:
The Docker image bundles the RDS global CA bundle at /etc/ssl/certs/rds-global-bundle.pem. Set:
NODE_EXTRA_CA_CERTS=/etc/ssl/certs/rds-global-bundle.pem
And append ?sslmode=verify-full to your DATABASE_URL.
For other providers: Mount your CA certificate into the container and set NODE_EXTRA_CA_CERTS to its path.
7.3 Migration failures
Symptom: upgrade.sh fails at the migration step. The API container exits immediately.
Common causes:
- Insufficient privileges: The migration user needs
CREATE TABLE,CREATE FUNCTION,CREATE INDEX,CREATE TRIGGER,CREATE SCHEMA(for pg-boss). On Aurora/RDS, this meansrds_superuserrole membership. On Cloud SQL, thecloudsqlsuperuserrole. - Separate migration URL: If your runtime
DATABASE_URLuses a restricted user, setDATABASE_URL_MIGRATEto a connection string with owner-role privileges.
Diagnosis:
# Check migration state
docker compose exec postgres psql -U agledger -d agledger \
-c "SELECT * FROM _migrations ORDER BY id DESC LIMIT 5;"
# Run migrations manually with verbose output
docker compose run --rm agledger-migrate
7.4 Vault integrity scan
Run a vault scan after any restore, infrastructure incident, or suspected data integrity issue:
curl -X POST http://localhost:3001/v1/admin/vault/scan \
-H "Authorization: Bearer $PLATFORM_KEY" \
-H "Content-Type: application/json" \
-d '{}'
The scan may return a jobId for asynchronous processing. Poll for results:
curl -s http://localhost:3001/v1/admin/vault/scan/$JOB_ID \
-H "Authorization: Bearer $PLATFORM_KEY" | jq .
A healthy scan reports zero broken chains. If broken chains are found, the response indicates which mandate IDs are affected. Contact support with the scan results and a support bundle.
7.5 Preflight check failures
The preflight script verifies database connectivity, migrations, and configuration:
./scripts/preflight.sh
If the API container is running, the script executes preflight inside it. Otherwise, it starts a one-off container on the compose network.
Common failures:
- Database not reachable: Check that PostgreSQL is running and
DATABASE_URLis correct - Pending migrations: Run
docker compose run --rm agledger-migrate - Missing environment variables: Compare your
.envwith the.env.examplein the distribution
8. High Availability
8.1 Multi-replica deployment
All API instances are stateless. They share a single PostgreSQL primary.
Docker Compose:
docker compose up -d --scale agledger-api=3 --scale agledger-worker=2
Kubernetes / Helm:
api:
replicaCount: 3
worker:
replicaCount: 2
8.2 Load balancer configuration
| Setting | Value | Why |
|---------|-------|-----|
| Health check path | /health/ready | Returns 503 during startup, shutdown, and DB outage |
| Health check interval | 5-10s | Fast enough to detect failover |
| Deregistration delay | 30s | Matches graceful shutdown drain period |
| Sticky sessions | Not required | All instances are stateless |
8.3 What is safe under concurrency
| Component | Mechanism |
|-----------|-----------|
| Vault hash chain | pg_advisory_xact_lock per mandate — concurrent writers serialize |
| Webhook sequence counters | UPDATE ... RETURNING — PostgreSQL row-level lock guarantees unique, monotonic values |
| Maintenance jobs | pg-boss singletonKey scheduling — only one instance picks up each job |
| Vault checkpoints | UNIQUE(mandate_id, chain_position) + ON CONFLICT DO NOTHING — dedup on retry |
| Provisioning reload | pg_try_advisory_lock — only one instance reconciles at a time |
| Auth/signing key caches | LISTEN/NOTIFY invalidation — changes propagate to all instances within seconds |
8.4 Connection failover
When PostgreSQL fails over (RDS Multi-AZ, Aurora failover), all connections drop simultaneously. AGLedger handles this automatically:
| Time | What happens |
|------|-------------|
| T+0s | Primary fails, connections drop |
| T+0-5s | Requests return 500, /health/ready returns 503 |
| T+5-10s | Load balancer deregisters unhealthy instances |
| T+10-30s | DNS/endpoint updates to new primary (varies by provider) |
| T+30-60s | Pool reconnects, health check passes, traffic resumes |
No manual intervention required. All state is in PostgreSQL. Transaction-scoped advisory locks are released automatically on connection drop.
8.5 Graceful shutdown
During rolling deploys, each instance:
- Receives SIGTERM
/health/readyreturns 503 immediately (stops new traffic)- In-flight HTTP requests drain (up to
HANDLER_TIMEOUT_MS, default 30s) - pg-boss stops gracefully (up to 20s for active jobs)
- Connection pool closes
- Process exits
Set your orchestrator's termination grace period to at least 35 seconds. The Kubernetes default of 30 seconds may need increasing.
8.6 Connection pool sizing
Each instance uses up to DATABASE_POOL_MAX connections (default 20). Total usage per instance: API pool + Worker pool (~10) + pg-boss overhead (~5).
Total connections = (API replicas + Worker replicas) * DATABASE_POOL_MAX
Ensure PostgreSQL's max_connections exceeds this total. For multi-replica deployments, reduce DATABASE_POOL_MAX:
DATABASE_POOL_MAX=10 # 5 instances * 10 = 50 connections
Reference sizing:
- Aurora Serverless 0.5 ACU: ~90 max_connections, set pool to 10
- Aurora Serverless 2+ ACU: 15-20 is fine
- RDS
db.t3.medium: 20 is fine - Cloud SQL shared-core: as low as 25 max_connections, set pool to 5-8
8.7 Rate limiting in multi-replica deployments
By default, rate limit counters are in-memory (per-process). With multiple replicas, each enforces limits independently, effectively multiplying the limit by the replica count.
For accurate cross-replica rate limiting, switch to the PostgreSQL-backed store:
RATE_LIMIT_STORE=postgresql
The PostgreSQL store uses an UNLOGGED table for performance. If the database crashes, counters reset — this only means a brief window of unenforced limits, not data loss.
Validated: 58 assertions covering HA, support bundle, and day-2 operations. View test source.