← Blog

2026-04-13

Engineering

150 Tests, Zero Mocks: How AGLedger Tests an Accountability Engine

An accountability system that hides bugs is worse than no accountability system. If the audit trail says PASS when the behavior was wrong, you have a liability, not a feature. That principle — aligned with NIST's minimum standards for developer verification — drives everything about how we test AGLedger.

Summary

AGLedger maintains an independent testbed that runs 150 tests against a live API — real EKS cluster, real Aurora database, real webhook delivery, real LLM agents. No mocks. No stubs. Every test produces a structured JSON result with three possible outcomes: PASS, FAIL, or SKIP. Failures are documented with finding numbers (F-NNN) and tracked to resolution. The testbed has cataloged over 350 findings since inception.

Philosophy

The goal is product improvement, not passing tests.

A test that marks a broken thing as PASS is worse than no test. A test that skips instead of failing hides the problem. We have three rules for test authors:

1. Never mark a broken thing as PASS

2. Never skip-as-PASS — use t.skip() with a reason

3. Assert the specific thing you're testing, not just HTTP 200

Every failure gets a finding number (F-001 through F-350+), a severity, expected vs. actual behavior, reproduction steps, and an impact assessment. Findings are tracked in a catalog that any team member can read. Over 314 have been resolved. The open ones are visible too.

Infrastructure: no mocks

The testbed runs against a live deployment — not a mock server, not a test double, not an in-memory stub. The stack:

Compute — EKS cluster (dedicated testbed namespace)

DatabaseAurora Serverless v2, PostgreSQL 17

Networking — ALB Ingress, WAF with IP allowlist

Deployment — Helm chart, same Docker image as production

Workers — Real async verification, settlement, and webhook delivery

Fresh namespace per test run. Ephemeral database. Auto-provisioned keys. Deterministic cleanup. The testbed tests what customers actually run.

What we test

150 tests organized by category. Each test is a standalone executable TypeScript file — no shared state, no fixtures, no implicit ordering.

CategoryTestsWhat it validates
Core lifecycle58Mandate creation through fulfillment, state transitions, contract types, receipt validation
Security31Permission boundaries, RBAC matrix, privilege escalation, multi-tenancy isolation, key rotation
Enterprise16Supplier approval, agent overreach protection, middleware, enterprise audit trails
Audit & compliance13Trail completeness, tamper detection, event sequencing, EU AI Act tracking
SOC 212Mapped to SOC 2 Trust Services Criteria: CC1.1 (TLS), CC6.1 (access controls), CC6.2 (escalation), CC6.6 (SSRF), CC7.2 (audit), PI1 (integrity)
Webhooks11HMAC signatures, delivery reliability, retry logic, dead-letter queue, failure recovery
Delegation chains7Multi-level agent-to-agent delegation, cascading verification, constraint inheritance
Agent DX4Real LLM agents (Claude, Gemini, GPT, Nova) discover and use the API with zero scaffolding

Test profiles

Not every change needs 150 tests. We run named profiles depending on what changed:

smoke — 5 minutes. Core lifecycle, basic auth. Run on every PR.

core — 58 tests. Full lifecycle coverage.

security — 31 tests. Permissions, RBAC, isolation.

soc2 — 12 tests. Mapped to SOC 2 controls (CC1.1, CC6.x, CC7.2, PI1).

customer-reality — 10 tests. Fresh registration, no exemptions, Unicode, SDK-only.

all — 150 tests. 54 minutes. Full suite.

Testing with real LLM agents

Four of our tests use real AI models — not simulated agents, not scripted calls. We give Claude Haiku, Gemini Flash, GPT-4o-mini, and Amazon Nova the API tool definitions and a business task. No documentation. No examples. No hand-holding.

The question: can an agent that has never seen AGLedger before discover and complete a full mandate lifecycle from tool descriptions alone? Research shows LLMs score 84–89% on synthetic benchmarks but only 25–34% on real-world tasks. We test the real-world number.

We measure: discovery rate, error recovery, steps to completion, which providers get stuck, and where. 100+ randomized business scenarios per run — procurement, analysis, coordination, infrastructure. Max 15 tool calls per task before timeout.

This is how we find out whether our API is usable, not whether it is correct. Correctness comes from the other 146 tests. Usability comes from watching real agents try.

Customer reality tests

The customer-reality profile simulates a real customer's first day. Fresh registration. No pre-configured accounts. SDK only (no raw API shortcuts). No rate limit exemptions. Unicode throughout. Error message quality checks.

If the onboarding path is broken, this profile catches it before a customer does.

What the numbers look like

Recent full run (April 8, 2026):

Profile — all (150 tests)

Duration — 54.5 minutes

Total assertions — 3,558

Passed — 2,920 (82.1%)

Failed — 569 (16.0%)

Findings cataloged — 350+ (314 resolved)

We publish these numbers because hiding them defeats the purpose. Most failures are single assertions within a test — an edge case in delegation cascading, a timing issue in webhook retry, a schema validation gap. Each one has a finding number and gets fixed.

Core lifecycle, authentication, and security tests are stable. Delegation chains and webhook delivery have the most open findings. That is where the complexity lives, and that is where we focus.

Why this matters

If you are evaluating AGLedger, the testbed is how we hold ourselves accountable. The same lifecycle we ask you to use for your agents — structured commitment, evidence of delivery, verdict — is what we apply to our own software.

Every test hits a real deployment. No mocks.

Every failure is cataloged and tracked. No hiding.

Every finding has a number, a severity, and a resolution path.

Real LLM agents test usability, not just correctness.

An accountability engine that cannot account for its own quality is not worth running.

Sources & further reading

NIST SP 800-115 — Technical Guide to Information Security Testing and Assessment

NIST IR 8397 — Guidelines on Minimum Standards for Developer Verification of Software

AICPA SOC 2 — Trust Services Criteria

RFC 8032 — Edwards-Curve Digital Signature Algorithm (Ed25519)

RFC 8785 — JSON Canonicalization Scheme (JCS)

arXiv 2510.26130 — Beyond Synthetic Benchmarks: Evaluating LLM Performance on Real-World Tasks

AWS Aurora PostgreSQL — Best Practices

Related