2026-04-13
Engineering150 Tests, Zero Mocks: How AGLedger Tests an Accountability Engine
An accountability system that hides bugs is worse than no accountability system. If the audit trail says PASS when the behavior was wrong, you have a liability, not a feature. That principle — aligned with NIST's minimum standards for developer verification — drives everything about how we test AGLedger.
Summary
AGLedger maintains an independent testbed that runs 150 tests against a live API — real EKS cluster, real Aurora database, real webhook delivery, real LLM agents. No mocks. No stubs. Every test produces a structured JSON result with three possible outcomes: PASS, FAIL, or SKIP. Failures are documented with finding numbers (F-NNN) and tracked to resolution. The testbed has cataloged over 350 findings since inception.
Philosophy
The goal is product improvement, not passing tests.
A test that marks a broken thing as PASS is worse than no test. A test that skips instead of failing hides the problem. We have three rules for test authors:
1. Never mark a broken thing as PASS
2. Never skip-as-PASS — use t.skip() with a reason
3. Assert the specific thing you're testing, not just HTTP 200
Every failure gets a finding number (F-001 through F-350+), a severity, expected vs. actual behavior, reproduction steps, and an impact assessment. Findings are tracked in a catalog that any team member can read. Over 314 have been resolved. The open ones are visible too.
Infrastructure: no mocks
The testbed runs against a live deployment — not a mock server, not a test double, not an in-memory stub. The stack:
Compute — EKS cluster (dedicated testbed namespace)
Database — Aurora Serverless v2, PostgreSQL 17
Networking — ALB Ingress, WAF with IP allowlist
Deployment — Helm chart, same Docker image as production
Workers — Real async verification, settlement, and webhook delivery
Fresh namespace per test run. Ephemeral database. Auto-provisioned keys. Deterministic cleanup. The testbed tests what customers actually run.
What we test
150 tests organized by category. Each test is a standalone executable TypeScript file — no shared state, no fixtures, no implicit ordering.
| Category | Tests | What it validates |
|---|---|---|
| Core lifecycle | 58 | Mandate creation through fulfillment, state transitions, contract types, receipt validation |
| Security | 31 | Permission boundaries, RBAC matrix, privilege escalation, multi-tenancy isolation, key rotation |
| Enterprise | 16 | Supplier approval, agent overreach protection, middleware, enterprise audit trails |
| Audit & compliance | 13 | Trail completeness, tamper detection, event sequencing, EU AI Act tracking |
| SOC 2 | 12 | Mapped to SOC 2 Trust Services Criteria: CC1.1 (TLS), CC6.1 (access controls), CC6.2 (escalation), CC6.6 (SSRF), CC7.2 (audit), PI1 (integrity) |
| Webhooks | 11 | HMAC signatures, delivery reliability, retry logic, dead-letter queue, failure recovery |
| Delegation chains | 7 | Multi-level agent-to-agent delegation, cascading verification, constraint inheritance |
| Agent DX | 4 | Real LLM agents (Claude, Gemini, GPT, Nova) discover and use the API with zero scaffolding |
Test profiles
Not every change needs 150 tests. We run named profiles depending on what changed:
smoke — 5 minutes. Core lifecycle, basic auth. Run on every PR.
core — 58 tests. Full lifecycle coverage.
security — 31 tests. Permissions, RBAC, isolation.
soc2 — 12 tests. Mapped to SOC 2 controls (CC1.1, CC6.x, CC7.2, PI1).
customer-reality — 10 tests. Fresh registration, no exemptions, Unicode, SDK-only.
all — 150 tests. 54 minutes. Full suite.
Testing with real LLM agents
Four of our tests use real AI models — not simulated agents, not scripted calls. We give Claude Haiku, Gemini Flash, GPT-4o-mini, and Amazon Nova the API tool definitions and a business task. No documentation. No examples. No hand-holding.
The question: can an agent that has never seen AGLedger before discover and complete a full mandate lifecycle from tool descriptions alone? Research shows LLMs score 84–89% on synthetic benchmarks but only 25–34% on real-world tasks. We test the real-world number.
We measure: discovery rate, error recovery, steps to completion, which providers get stuck, and where. 100+ randomized business scenarios per run — procurement, analysis, coordination, infrastructure. Max 15 tool calls per task before timeout.
This is how we find out whether our API is usable, not whether it is correct. Correctness comes from the other 146 tests. Usability comes from watching real agents try.
Customer reality tests
The customer-reality profile simulates a real customer's first day. Fresh registration. No pre-configured accounts. SDK only (no raw API shortcuts). No rate limit exemptions. Unicode throughout. Error message quality checks.
If the onboarding path is broken, this profile catches it before a customer does.
What the numbers look like
Recent full run (April 8, 2026):
Profile — all (150 tests)
Duration — 54.5 minutes
Total assertions — 3,558
Passed — 2,920 (82.1%)
Failed — 569 (16.0%)
Findings cataloged — 350+ (314 resolved)
We publish these numbers because hiding them defeats the purpose. Most failures are single assertions within a test — an edge case in delegation cascading, a timing issue in webhook retry, a schema validation gap. Each one has a finding number and gets fixed.
Core lifecycle, authentication, and security tests are stable. Delegation chains and webhook delivery have the most open findings. That is where the complexity lives, and that is where we focus.
Why this matters
If you are evaluating AGLedger, the testbed is how we hold ourselves accountable. The same lifecycle we ask you to use for your agents — structured commitment, evidence of delivery, verdict — is what we apply to our own software.
Every test hits a real deployment. No mocks.
Every failure is cataloged and tracked. No hiding.
Every finding has a number, a severity, and a resolution path.
Real LLM agents test usability, not just correctness.
An accountability engine that cannot account for its own quality is not worth running.
Sources & further reading
NIST SP 800-115 — Technical Guide to Information Security Testing and Assessment
NIST IR 8397 — Guidelines on Minimum Standards for Developer Verification of Software
AICPA SOC 2 — Trust Services Criteria
RFC 8032 — Edwards-Curve Digital Signature Algorithm (Ed25519)
RFC 8785 — JSON Canonicalization Scheme (JCS)
arXiv 2510.26130 — Beyond Synthetic Benchmarks: Evaluating LLM Performance on Real-World Tasks
AWS Aurora PostgreSQL — Best Practices