Self-Hosted Operations

Day-2 operations for self-hosted AGLedger: upgrades, backup/restore, rollback, secret rotation, observability, troubleshooting, and high availability.

This guide assumes AGLedger is installed and running. All commands use the scripts included in the self-hosted distribution. Paths are relative to the self-hosted repository root.

1. Upgrading

The upgrade.sh script automates the full upgrade workflow: backup, image pull, migration, restart, and verification.

Basic upgrade

./upgrade.sh 1.3.0

The script performs these steps in order:

  1. Detects the current running version (from .env, VERSION file, or Docker image tag)
  2. Creates a pre-upgrade backup (calls ./scripts/backup.sh)
  3. Authenticates with the container registry (ECR)
  4. Pulls the target version image
  5. Stops the worker to prevent job processing during migration
  6. Runs database migrations using the new image
  7. Updates AGLEDGER_VERSION in .env and VERSION file
  8. Restarts all services
  9. Waits for the API to become healthy (up to 30 seconds)
  10. Runs preflight checks
  11. Verifies the new version via /health and /conformance

Skipping the pre-upgrade backup

./upgrade.sh 1.3.0 --skip-backup

Not recommended for production. Use only if you have a separate backup mechanism (e.g., RDS automated snapshots).

Deprecated environment variables

The upgrade script automatically handles removed variables:

Verifying after upgrade

curl -s http://localhost:3001/health | jq .
curl -s http://localhost:3001/conformance | jq .version
docker compose -f docker-compose/docker-compose.yml ps

2. Backup and Restore

2.1 Creating a backup

./scripts/backup.sh

This creates a timestamped tarball containing a PostgreSQL custom-format dump (pg_dump -Fc). The default location is ./backup/backup-YYYY-MM-DD-HHMMSS.tar.gz.

Retention: By default, the script keeps the 7 most recent backups and deletes older ones.

# Keep last 14 backups
./scripts/backup.sh --keep 14

# Custom backup directory
BACKUP_DIR=/mnt/backups ./scripts/backup.sh

Bundled vs. external database: The script auto-detects your database mode from DATABASE_URL in .env. For bundled PostgreSQL, it uses docker compose exec postgres pg_dump. For external databases (Aurora, RDS, Cloud SQL), it calls pg_dump directly — ensure the PostgreSQL client tools are installed on the host.

2.2 Restoring from backup

./scripts/restore.sh backup/backup-2026-03-14-120000.tar.gz

The restore script:

  1. Prompts for confirmation (the restore replaces all data in the database)
  2. Extracts the backup tarball to a temporary directory
  3. Stops the API and worker services
  4. Terminates active database connections
  5. Drops and recreates the database
  6. Runs pg_restore --no-owner --no-acl from the dump
  7. Restarts all services with docker compose up -d --wait
  8. Runs preflight checks to verify the restore

For non-interactive use (CI, automation):

./scripts/restore.sh --non-interactive backup/backup-2026-03-14-120000.tar.gz

External database restore: For external databases, the restore script uses psql and pg_restore directly. The database user must have CREATEDB privilege (or the ability to drop and recreate the target database). If DATABASE_URL_MIGRATE is set, the script uses that connection (which typically has owner-role privileges for DDL).

2.3 Managed database backups

If you run AGLedger on a managed PostgreSQL service, use the provider's native backup tools alongside (or instead of) the script-based backups:

| Provider | Backup mechanism | Notes | |----------|-----------------|-------| | AWS RDS / Aurora | Automated snapshots + point-in-time recovery | Enable automated backups; set retention to at least 7 days | | Google Cloud SQL | Automated backups + on-demand backups | Enable in the instance configuration | | Azure Database for PostgreSQL | Automated backups (enabled by default) | Retention configurable 7-35 days |

Managed backups provide continuous protection with near-zero RPO. The script-based backups (pg_dump) are useful for cross-version migration and portable archives.

2.4 Post-restore verification

After restoring from any backup method, verify the instance:

# 1. Health check
curl -s http://localhost:3001/health | jq .

# 2. Vault chain integrity scan
curl -X POST http://localhost:3001/v1/admin/vault/scan \
  -H "Authorization: Bearer $PLATFORM_KEY" \
  -H "Content-Type: application/json" \
  -d '{}'

# 3. If using YAML provisioning, reload to reconcile state
curl -X POST http://localhost:3001/v1/admin/provisioning/reload \
  -H "Authorization: Bearer $PLATFORM_KEY" \
  -H "Content-Type: application/json"

# 4. Run a smoke-test lifecycle (create mandate, submit receipt, verify settlement)

The vault scan walks every mandate's hash chain and reports any broken chains. After a clean restore, all chains should be intact up to the restore point — PostgreSQL's transactional consistency guarantees that partial vault entries cannot exist.

3. Rollback

The rollback.sh script restores the database from backup and reverts to a previous image version. Database migrations are forward-only, so rollback means restoring the full database state.

Auto-detected rollback

./scripts/rollback.sh

Without arguments, the script:

Explicit rollback

# Specify the target version
./scripts/rollback.sh --version 0.9.2

# Specify a specific backup file
./scripts/rollback.sh --backup backup/backup-2026-03-25-120000.tar.gz

# Non-interactive (for automation)
./scripts/rollback.sh --non-interactive --version 0.9.2

What rollback does

  1. Locates the backup (most recent, or specified with --backup)
  2. Resolves the target version (from .pre-upgrade-version, --version, or interactive prompt)
  3. Restores the database by calling ./scripts/restore.sh --non-interactive
  4. Updates AGLEDGER_VERSION in .env and VERSION file
  5. Pulls the target version image (falls back to cached image if pull fails)
  6. Restarts all services

Data loss warning: All data written after the backup will be lost. This includes mandates, receipts, vault entries, and webhook deliveries created between the backup and the rollback.

4. Secret Rotation

Two secrets may need rotation: API_KEY_SECRET (used for HMAC hashing of API keys) and VAULT_SIGNING_KEY (used for Ed25519 signatures on audit vault entries).

Run the interactive rotation guide:

./scripts/rotate-secrets.sh                  # Prompts for which secret
./scripts/rotate-secrets.sh api-key-secret   # Rotate API key secret directly
./scripts/rotate-secrets.sh vault-signing-key # Rotate vault signing key directly

4.1 API_KEY_SECRET rotation

This is a 6-step process with a dual-secret window to avoid downtime:

  1. Save the current API_KEY_SECRET as API_KEY_SECRET_PREVIOUS in .env
  2. Generate a new API_KEY_SECRET (64-character hex via openssl rand -hex 32)
  3. Restart services — dual-secret mode activates (both old and new keys work)
  4. Re-hash all API keys with the new secret (rehash-api-keys.js). This step is irreversible.
  5. Restart services to verify
  6. Remove API_KEY_SECRET_PREVIOUS from .env and restart (old hashes stop working)

Between steps 3 and 6, both the old and new secrets are active. This allows you to pause and verify that integrations still work before committing.

4.2 VAULT_SIGNING_KEY rotation

Vault signing key rotation is simpler because old signatures remain valid — they were signed at creation time and are not re-verified against the current key.

  1. Generate a new Ed25519 signing key (via generate-signing-key.js inside the container)
  2. Update VAULT_SIGNING_KEY in .env
  3. Restart services

New vault entries will be signed with the new key. Old entries retain their original signatures. Keep a record of the old key if you need to verify historical signatures.

For detailed key lifecycle management, see the Vault Signing Key Guide.

5. Observability

5.1 Monitoring stack (Docker Compose)

The Docker Compose configuration includes an optional monitoring profile with OTel Collector, Jaeger, Prometheus, and Grafana:

docker compose --profile monitoring \
  -f docker-compose.yml \
  -f docker-compose.postgres.yml \
  up -d --wait

This starts four additional services:

| Service | Port | Purpose | |---------|------|---------| | OTel Collector | 4317 (gRPC) | Receives traces from AGLedger, exports to Jaeger | | Jaeger | 16686 | Distributed trace viewer (Jaeger v2) | | Prometheus | 9090 | Metrics scraping and storage | | Grafana | 3003 | Dashboard UI (default password: admin) |

5.2 Health endpoints

| Endpoint | Port | Auth | Use for | |----------|------|------|---------| | GET /health | 3001 (API), 3002 (Worker) | None | Liveness probes. Returns {"status":"ok"}. | | GET /health/ready | 3001 (API), 3002 (Worker) | None | Readiness probes. Returns 503 if database is unreachable. | | GET /status | 3001 | None | Public status page with database health check. | | GET /v1/admin/system-health | 3001 | Platform key | Detailed system health: DB latency, pool stats, memory, pg-boss queue counts. |

For Kubernetes deployments:

livenessProbe:
  httpGet:
    path: /health
    port: 3000
  initialDelaySeconds: 5
  periodSeconds: 10

readinessProbe:
  httpGet:
    path: /health/ready
    port: 3000
  initialDelaySeconds: 5
  periodSeconds: 10

5.3 Prometheus metrics

Both the API (port 3000) and Worker (port 3001) expose GET /metrics in Prometheus exposition format. No authentication required — intended for internal network scraping only.

Business metrics:

| Metric | Type | Description | |--------|------|-------------| | agledger_mandate_transitions_total | Counter | State transitions, labeled from_status and to_status | | agledger_verification_duration_seconds | Histogram | Phase 2 verification latency by contract_type and result | | agledger_worker_jobs_processed_total | Counter | Worker job completions by queue and status |

Infrastructure metrics:

| Metric | Type | Description | |--------|------|-------------| | agledger_http_request_duration_seconds | Histogram | HTTP latency by method, route, status_code | | agledger_db_pool_total_connections | Gauge | Total PostgreSQL pool connections | | agledger_db_pool_idle_connections | Gauge | Idle connections available | | agledger_db_pool_waiting_connections | Gauge | Clients waiting for a connection (non-zero = pool exhaustion) |

Cache metrics:

| Metric | Type | Description | |--------|------|-------------| | agledger_auth_cache_hits_total | Counter | API key auth cache hits | | agledger_auth_cache_misses_total | Counter | Auth cache misses (DB lookup required) | | agledger_schema_cache_hits_total | Counter | Contract type schema cache hits | | agledger_schema_cache_misses_total | Counter | Schema cache misses |

Process metrics (automatic, prefixed agledger_): CPU usage, resident memory, event loop lag, GC duration, active handles/requests.

5.4 Useful PromQL queries

# P95 request latency
histogram_quantile(0.95, rate(agledger_http_request_duration_seconds_bucket[5m]))

# Error rate (5xx)
sum(rate(agledger_http_request_duration_seconds_count{status_code=~"5.."}[5m]))
/
sum(rate(agledger_http_request_duration_seconds_count[5m]))

# Mandate transitions by destination state
sum by (to_status) (rate(agledger_mandate_transitions_total[5m]))

# DB pool saturation (alert if > 0)
agledger_db_pool_waiting_connections > 0

# Auth cache hit rate (target > 90%)
rate(agledger_auth_cache_hits_total[5m])
/
(rate(agledger_auth_cache_hits_total[5m]) + rate(agledger_auth_cache_misses_total[5m]))

# Worker job failure rate by queue
sum by (queue) (rate(agledger_worker_jobs_processed_total{status="failure"}[5m]))
/
sum by (queue) (rate(agledger_worker_jobs_processed_total[5m]))

5.5 Recommended Alertmanager rules

groups:
  - name: agledger
    rules:
      - alert: HighErrorRate
        expr: |
          sum(rate(agledger_http_request_duration_seconds_count{status_code=~"5.."}[5m]))
          /
          sum(rate(agledger_http_request_duration_seconds_count[5m]))
          > 0.05
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "AGLedger API error rate above 5%"

      - alert: VerificationSlow
        expr: |
          histogram_quantile(0.95, rate(agledger_verification_duration_seconds_bucket[5m])) > 5
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Phase 2 verification P95 latency above 5 seconds"

      - alert: DBPoolExhaustion
        expr: agledger_db_pool_waiting_connections > 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Database connection pool has waiting clients"

      - alert: WorkerJobFailures
        expr: |
          rate(agledger_worker_jobs_processed_total{status="failure"}[5m]) > 0.1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Worker job failure rate elevated on queue {{ $labels.queue }}"

      - alert: AuthCacheLowHitRate
        expr: |
          rate(agledger_auth_cache_hits_total[5m])
          /
          (rate(agledger_auth_cache_hits_total[5m]) + rate(agledger_auth_cache_misses_total[5m]))
          < 0.8
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Auth cache hit rate below 80%"

5.6 Distributed tracing (OpenTelemetry)

Tracing is disabled by default. Set OTEL_EXPORTER_OTLP_ENDPOINT to activate it.

| Variable | Default | Description | |----------|---------|-------------| | OTEL_EXPORTER_OTLP_ENDPOINT | (unset) | OTLP gRPC endpoint (e.g., http://jaeger:4317). Setting this activates tracing. | | OTEL_PROVIDER | generic | generic (W3C Trace-Context) or xray (AWS X-Ray ID format). | | OTEL_SERVICE_NAME | agledger-api | Service name in trace metadata. Worker uses agledger-worker. |

Auto-instrumented: HTTP requests (Fastify), PostgreSQL queries, outbound HTTP (webhook delivery). Custom span attributes include mandateId, agentId, enterpriseId, contractType.

5.7 Logging

AGLedger uses pino for structured JSON logging. Every log line is a JSON object with level, time, msg, and contextual fields (mandateId, reqId, etc.). Sensitive fields (API keys, secrets, passwords, tokens) are automatically redacted.

| Variable | Default | Description | |----------|---------|-------------| | LOG_LEVEL | info | Minimum level: trace, debug, info, warn, error, fatal |

5.8 SIEM integration

AGLedger can export audit events to your SIEM. Configure via environment variables:

| Variable | Default | Description | |----------|---------|-------------| | SIEM_ENABLED | false | Enable SIEM event export | | SIEM_FORMAT | ocsf | Export format: ocsf or raw | | SIEM_FILE_ENABLED | true | Write events to a local file (when SIEM is enabled) | | SIEM_FILE_PATH | /var/log/agledger/siem.ndjson | File path for NDJSON output | | SIEM_HTTP_ENABLED | false | Push events to an HTTP endpoint | | SIEM_HTTP_URL | (empty) | HTTP endpoint URL | | SIEM_HTTP_AUTH_HEADER | (empty) | Authorization header value for HTTP push | | SIEM_BATCH_SIZE | 50 | Events per batch | | SIEM_FLUSH_INTERVAL_MS | 5000 | Flush interval in milliseconds |

6. Support Bundle

When contacting AGLedger support, generate a diagnostic bundle that collects system state with all secrets automatically redacted.

6.1 CLI support bundle

./scripts/support-bundle.sh

The script collects:

The bundle is saved as ./support-bundle-YYYY-MM-DD-HHMMSS.tar.gz. Send it to support@agledger.ai.

6.2 API support bundle

The admin API provides a JSON support bundle with 9 sections:

curl -s http://localhost:3001/v1/admin/support-bundle \
  -H "Authorization: Bearer $PLATFORM_KEY" | jq .

Response sections:

| Section | Contents | |---------|----------| | manifest | Bundle version, generation timestamp, section index | | version | AGLedger version, Node.js version, operating mode (standalone/gateway/hub) | | license | License tier, status, features | | health | Database connectivity, component status | | authCache | Cache size, max capacity, hit rate | | config | Runtime configuration (secrets excluded) | | database | PostgreSQL version, migration state, table sizes | | environment | Platform, CPU count, memory | | guidance | List of items not included in the bundle and why |

The API bundle requires a platform key. Enterprise and agent keys receive 403.

7. Troubleshooting Runbook

7.1 Connection pooler incompatibility

Symptom: Jobs silently stop processing. Webhook deliveries stall. Worker logs show no errors but no activity.

Cause: AGLedger uses pg-boss, which requires PostgreSQL LISTEN/NOTIFY and session-level advisory locks. Transaction-mode connection poolers break both features.

Incompatible poolers:

Fix: Connect AGLedger directly to the PostgreSQL instance, bypassing any connection pooler. If you must use a pooler for other applications, configure AGLedger's DATABASE_URL to point to the direct endpoint.

7.2 SSL/TLS setup issues

Symptom: ECONNREFUSED or SSL handshake errors when connecting to a managed database.

For AWS RDS/Aurora: The Docker image bundles the RDS global CA bundle at /etc/ssl/certs/rds-global-bundle.pem. Set:

NODE_EXTRA_CA_CERTS=/etc/ssl/certs/rds-global-bundle.pem

And append ?sslmode=verify-full to your DATABASE_URL.

For other providers: Mount your CA certificate into the container and set NODE_EXTRA_CA_CERTS to its path.

7.3 Migration failures

Symptom: upgrade.sh fails at the migration step. The API container exits immediately.

Common causes:

Diagnosis:

# Check migration state
docker compose exec postgres psql -U agledger -d agledger \
  -c "SELECT * FROM _migrations ORDER BY id DESC LIMIT 5;"

# Run migrations manually with verbose output
docker compose run --rm agledger-migrate

7.4 Vault integrity scan

Run a vault scan after any restore, infrastructure incident, or suspected data integrity issue:

curl -X POST http://localhost:3001/v1/admin/vault/scan \
  -H "Authorization: Bearer $PLATFORM_KEY" \
  -H "Content-Type: application/json" \
  -d '{}'

The scan may return a jobId for asynchronous processing. Poll for results:

curl -s http://localhost:3001/v1/admin/vault/scan/$JOB_ID \
  -H "Authorization: Bearer $PLATFORM_KEY" | jq .

A healthy scan reports zero broken chains. If broken chains are found, the response indicates which mandate IDs are affected. Contact support with the scan results and a support bundle.

7.5 Preflight check failures

The preflight script verifies database connectivity, migrations, and configuration:

./scripts/preflight.sh

If the API container is running, the script executes preflight inside it. Otherwise, it starts a one-off container on the compose network.

Common failures:

8. High Availability

8.1 Multi-replica deployment

All API instances are stateless. They share a single PostgreSQL primary.

Docker Compose:

docker compose up -d --scale agledger-api=3 --scale agledger-worker=2

Kubernetes / Helm:

api:
  replicaCount: 3

worker:
  replicaCount: 2

8.2 Load balancer configuration

| Setting | Value | Why | |---------|-------|-----| | Health check path | /health/ready | Returns 503 during startup, shutdown, and DB outage | | Health check interval | 5-10s | Fast enough to detect failover | | Deregistration delay | 30s | Matches graceful shutdown drain period | | Sticky sessions | Not required | All instances are stateless |

8.3 What is safe under concurrency

| Component | Mechanism | |-----------|-----------| | Vault hash chain | pg_advisory_xact_lock per mandate — concurrent writers serialize | | Webhook sequence counters | UPDATE ... RETURNING — PostgreSQL row-level lock guarantees unique, monotonic values | | Maintenance jobs | pg-boss singletonKey scheduling — only one instance picks up each job | | Vault checkpoints | UNIQUE(mandate_id, chain_position) + ON CONFLICT DO NOTHING — dedup on retry | | Provisioning reload | pg_try_advisory_lock — only one instance reconciles at a time | | Auth/signing key caches | LISTEN/NOTIFY invalidation — changes propagate to all instances within seconds |

8.4 Connection failover

When PostgreSQL fails over (RDS Multi-AZ, Aurora failover), all connections drop simultaneously. AGLedger handles this automatically:

| Time | What happens | |------|-------------| | T+0s | Primary fails, connections drop | | T+0-5s | Requests return 500, /health/ready returns 503 | | T+5-10s | Load balancer deregisters unhealthy instances | | T+10-30s | DNS/endpoint updates to new primary (varies by provider) | | T+30-60s | Pool reconnects, health check passes, traffic resumes |

No manual intervention required. All state is in PostgreSQL. Transaction-scoped advisory locks are released automatically on connection drop.

8.5 Graceful shutdown

During rolling deploys, each instance:

  1. Receives SIGTERM
  2. /health/ready returns 503 immediately (stops new traffic)
  3. In-flight HTTP requests drain (up to HANDLER_TIMEOUT_MS, default 30s)
  4. pg-boss stops gracefully (up to 20s for active jobs)
  5. Connection pool closes
  6. Process exits

Set your orchestrator's termination grace period to at least 35 seconds. The Kubernetes default of 30 seconds may need increasing.

8.6 Connection pool sizing

Each instance uses up to DATABASE_POOL_MAX connections (default 20). Total usage per instance: API pool + Worker pool (~10) + pg-boss overhead (~5).

Total connections = (API replicas + Worker replicas) * DATABASE_POOL_MAX

Ensure PostgreSQL's max_connections exceeds this total. For multi-replica deployments, reduce DATABASE_POOL_MAX:

DATABASE_POOL_MAX=10   # 5 instances * 10 = 50 connections

Reference sizing:

8.7 Rate limiting in multi-replica deployments

By default, rate limit counters are in-memory (per-process). With multiple replicas, each enforces limits independently, effectively multiplying the limit by the replica count.

For accurate cross-replica rate limiting, switch to the PostgreSQL-backed store:

RATE_LIMIT_STORE=postgresql

The PostgreSQL store uses an UNLOGGED table for performance. If the database crashes, counters reset — this only means a brief window of unenforced limits, not data loss.


Validated: 58 assertions covering HA, support bundle, and day-2 operations. View test source.