2026-04-13 · updated 2026-04-18

Engineering

Zero Mocks, Real Infrastructure: How AGLedger Tests an Accountability Engine

By Michael Cooper · Founder

Note: Numbers in this post were measured on AGLedger v0.19.10. The current API baseline is v0.23.0 (2026-05-19). Methodology and direction hold; specific magnitudes may differ on the current version.

An accountability system that hides bugs is worse than no accountability system. If the audit trail says PASS when the behavior was wrong, you have a liability, not a feature. That principle, aligned with NIST's minimum standards for developer verification, drives everything about how we test AGLedger.

Summary

AGLedger’s independent testbed runs 36 tests against a live API: real EKS cluster, real Aurora database, real webhook delivery, real LLM agents. No mocks. No stubs. Every test produces a structured JSON result with three possible outcomes: PASS, FAIL, or SKIP. Failures are documented with finding numbers (F-NNN, starting at F-001) and tracked to resolution.

Last week the suite had 152 tests. We deleted 117 of them. The ones we kept focus on behavior you can only learn from a real deployment. The rest were duplicating the AGLedger API repo’s own integration suite and adding noise without signal.

Why we deleted 117 tests

A testbed only earns its keep when it catches things other tests can’t. If the API repo already has a state-machine unit test for record transitions, re-running that logic against a live deployment doesn’t add information; it adds latency. It also dilutes the signal when a real-infrastructure bug slips in, because the failing test is buried in a long list of tests that have nothing to do with deployment.

On 2026-04-16 we audited every test against a single question: does this only manifest on deployed infrastructure, or against a customer-facing surface? Tests that failed that question were deleted. 117 of them did. What remains: 36 tests that exercise the ALB, the WAF, Aurora, TLS, the SDKs, the CLI, the MCP server, webhook delivery, federation, HA failover, and day-2 operations.

Fewer tests, higher signal. Test runtime dropped from 54.5 minutes to roughly 5. Every failure now points to something the testbed is uniquely positioned to catch.

Philosophy

The goal is product improvement, not passing tests.

A test that marks a broken thing as PASS is worse than no test. A test that skips instead of failing hides the problem. We have three rules for test authors:

1. Never mark a broken thing as PASS

2. Never skip-as-PASS. Use t.skip() with a reason

3. Assert the specific thing you're testing, not just HTTP 200

Every failure gets a finding number (F-001 onward), a severity, expected vs. actual behavior, reproduction steps, and an impact assessment. Findings are tracked in a catalog any team member can read. Resolved findings move to an archive. The open ones stay visible.

Infrastructure: no mocks

The testbed runs against a live deployment: not a mock server, not a test double, not an in-memory stub. The stack:

Compute: EKS cluster (dedicated testbed namespace)

Database: Aurora Serverless v2, PostgreSQL 17

Networking: ALB Ingress, WAF with IP allowlist

Deployment: Helm chart, same Docker image as production

Workers: real async verification, settlement, and webhook delivery

Fresh namespace per test run. Ephemeral database. Auto-provisioned keys. Deterministic cleanup. The testbed tests what customers actually run.

What we test

36 tests organized by deployment surface and customer interface. Each test is a standalone executable TypeScript file: no shared state, no fixtures, no implicit ordering.

Profile	What it validates
infra	ALB health, WAF rules, TLS cert handling, Aurora failover behavior, HA support bundle
onboarding	First-run install, license provisioning, fresh registration, day-one customer path
integration	Orchestrator chains, mixed-chain (ERP + AI) workflows, entity references, scope profiles
SDK	TypeScript SDK (`@agledger/sdk`), Python SDK (`agledger`), CLI, MCP server, zero-scaffolding discovery
webhooks	HMAC signatures, delivery reliability, retry logic, dead-letter queue, failure recovery
federation	Cross-boundary signing, Settlement Signal emission, peer-to-peer Server interoperability
compliance	SOC 2 control mapping (CC1.1, CC6.x, CC7.2, PI1), audit trail completeness, tamper detection
day-2	Vault signing keys, scope profile management, YAML provisioning, operational runbooks

Logic tests (state machines, RBAC matrices, schema validation, cryptographic primitives) live in the AGLedger API repo’s own integration suite. That is where they belong. The testbed never re-runs them. The shape is Fowler’s practical test pyramid: a wide unit base in the API repo, a thin integration band on top, and a deliberately small deployed-infrastructure tip here in the testbed.

Test profiles

Not every change needs the full 36 tests. Named profiles run the relevant subset:

smoke: core lifecycle plus basic auth. Run on every PR.

infra: ALB, WAF, Aurora, TLS behavior.

onboarding: fresh customer install path.

integration: orchestrator, mixed chains, entity references.

soc2: mapped to SOC 2 controls (CC1.1, CC6.x, CC7.2, PI1).

all: 36 tests. ~5 minutes. Full suite on release.

Stress tests with pg_stat_statements

Correctness tests catch bugs. Stress tests catch cost. We run a separate audit-read stress harness (20 concurrent readers for 5 minutes against 2,260 records on Aurora Serverless v2) and record pg_stat_statements for every release. The output is a perf-snapshot directory with before/after query breakdowns and a findings file that flags regressions.

This is how we caught a 6-query aggregation pattern in the enterprise-report endpoint that was burning 70 seconds of database CPU every 5 minutes. The rewrite cut it to 39. We wrote it up, including the honest caveat about why wall-clock p50 didn’t improve the way you’d expect.

Testing with real LLM agents

Four of our tests use real AI models, not simulated agents, not scripted calls. We give Claude Haiku, Gemini Flash, GPT-4o-mini, and Amazon Nova the API tool definitions and a business task. No documentation. No examples. No hand-holding.

The question: can an agent that has never seen AGLedger before discover and complete a full record lifecycle from tool descriptions alone? Research on class-level code generation found 84–89% correctness on synthetic benchmarks vs. 25–34% on real-world classes (arXiv 2510.26130), a synthetic-vs-real gap we observe in API-discovery tasks too. We test the real-world number.

We measure: discovery rate, error recovery, steps to completion, which providers get stuck, and where. Randomized business scenarios: procurement, analysis, coordination, infrastructure. Max 15 tool calls per task before timeout.

This is how we find out whether our API is usable, not whether it is correct. Correctness comes from the API repo’s own suite. Usability comes from watching real agents try.

The findings from those agent runs feed three downstream posts: zero-scaffolding API discovery (can an agent learn the API from /llms.txt alone?), budget LLMs outperform premium (the Doers-vs-Planners data), and A2A v1 learnings (what we changed in the API after watching agents fail at it).

Customer reality tests

The onboarding profile simulates a real customer's first day. Fresh registration. No pre-configured accounts. SDK only (no raw API shortcuts). No rate limit exemptions. Unicode throughout. Error message quality checks.

If the onboarding path is broken, this profile catches it before a customer does.

What the numbers look like

Baseline at publication (v0.19.10, 2026-04-17):

Profile: all (36 tests)

Assertions: 951

Passed: 927

Failed: 7

Skipped: 17

Findings cataloged: F-001 onward (most resolved; open ones tracked to resolution)

We publish these numbers because hiding them defeats the purpose. Most failures are single assertions within a test: an edge case in delegation cascading, a timing issue in webhook retry, a schema validation gap. Each one has a finding number and gets fixed.

Core lifecycle, authentication, and security tests are stable. Delegation chains and federation signing have the most open findings. That is where the complexity lives, and that is where we focus.

Why this matters

If you are evaluating AGLedger, the testbed is how we hold ourselves accountable. The same lifecycle we ask you to use for your agents (structured record, evidence of delivery, verdict) is what we apply to our own software.

Every test hits a real deployment. No mocks.

Every failure is cataloged and tracked. No hiding.

Every finding has a number, a severity, and a resolution path.

Real LLM agents test usability, not just correctness.

Stress tests catch cost regressions, not just bugs.

An accountability engine that cannot account for its own quality is not worth running.

Sources & further reading

NIST SP 800-115 — Technical Guide to Information Security Testing and Assessment

NIST IR 8397 — Guidelines on Minimum Standards for Developer Verification of Software

AICPA SOC 2 — Trust Services Criteria

PostgreSQL pg_stat_statements — Query execution statistics

RFC 8032 — Edwards-Curve Digital Signature Algorithm (Ed25519)

RFC 9052 — CBOR Object Signing and Encryption (COSE), the envelope wrapping each chain entry as of v0.23.0

RFC 8949 — Concise Binary Object Representation (CBOR); §4.2.1 deterministic encoding rules

RFC 8785 — JSON Canonicalization Scheme (JCS), used for federation transport canonicalization

arXiv 2510.26130 — Beyond Synthetic Benchmarks: Evaluating LLM Performance on Real-World Tasks

AWS Aurora PostgreSQL — Best Practices

Martin Fowler — The Practical Test Pyramid

perfPostgreSQL GROUPING SETS: a 44% query-time cut

perfAGLedger performance at scale

agentsBudget LLMs outperform premium on accountability tasks

securitySecurity architecture

docsDocumentation