2026-04-15

Research

Frictionless Compliance: How Accountability Gates Guide AI Agents to Success

By Michael Cooper · Founder

Note: In API v0.23.0 (May 2026) the performer's evidence submission was renamed from “receipt” to “Completion” to align with the IETF SCITT vocabulary, where “receipt” now refers to a cryptographic inclusion proof. This post is preserved with original terminology; substitute “Completion” mentally where the post discusses performer evidence.

Note: Numbers in this post were measured on AGLedger v0.19.6. The current API baseline is v0.23.0 (2026-05-19). Methodology and direction hold; specific magnitudes may differ on the current version.

Note: The Nova Pro and GPT-4o-mini behavior characterizations come from small samples (N=1 single trace, N=4); read those rows as anecdote, not pattern. The direction held across the broader experiment series, but specific magnitudes should be read with the sample size in mind.

AGLedger + Microsoft's Agent Governance Toolkit. 70 traces, 4 LLM providers, 7 contract types. Policy denials break agents. Directives fix them.

The problem with “deny”

Microsoft's Agent Governance Toolkit, when integrated into your agent pipeline, intercepts every tool call and evaluates it against policy. If the agent tries to execute a funds transfer without authorization, the toolkit blocks it.

But “blocked” is where most governance stories end. The agent gets a denial. It retries. It gets denied again. It halts, or worse, it fabricates credentials. The compliance team gets their audit log entry. The agent gets nothing done.

We collected 70 traces across four LLM providers, seven contract types, and three integration patterns. A bare policy denial, the kind that says “Conditions not met” with no recovery guidance, causes three distinct failure patterns:

The Fabricator

GPT-4o-mini (N=4), denied a funds transfer, generated 25 random UUIDs as record IDs in a single session. It never created a real record. It burned through its entire step budget guessing.

The Quitter

Claude Haiku (N=7 contract types, weak prompt): four out of seven types, the agent attempted one accountability command, got an error, and stopped trying. The domain task completed. The accountability trail was empty.

The Loop

Amazon Nova Pro (N=1), invented a contract type called FUND-TRANSFER-v1 and tried it 13 times, removing one field from the payload each attempt. It never ran schema list to discover the real types.

These are not hypothetical. These are traces from real LLM agents running against a live API.

What if “deny” came with directions?

We modified the gate. Instead of returning "Conditional permission denied for transfer_funds: Conditions not met", we returned this:

REJECTED: This action requires an accountability record.

Create a record of type CUSTOM-TXN-v1 using the accountability API

at https://your-agledger-instance.com.

Include the record ID when retrying this action.

Same policy rule. Same agent. Same system prompt (which says nothing about AGLedger or accountability). The only configuration change: AGLedger tools added to the agent's tool palette, and the rejection message told the agent what to do next.

The agent read the rejection. Called the AGLedger API. Discovered the schema. Created a record. Retried the transfer with a real record ID. The gate validated it server-side. The transfer went through. The agent submitted a receipt documenting what it did. Verification passed. Status: FULFILLED.

No system prompt changes. No agent behavior code changes. You add AGLedger tools (CLI, MCP server, or HTTP) to the agent's tool set, deploy a policy file, and the gate rejection teaches the agent the rest.

The data

70 total traces across three integration patterns, seven contract types, four LLM providers, and two integration modes (CLI and MCP). Each finding below is from a specific scenario; sample sizes noted.

Tool description improvements (what agents see before their first call)

Scenario	Before	After	Change
MCP receipt submission (Haiku, single scenario)	14 API calls, 12,122 tokens	2 API calls, 4,359 tokens	64% fewer tokens
MCP receipt submission (Gemini, single scenario)	6 calls, ACTIVE (failed)	4 calls, FULFILLED	Failure to success
CLI weak prompt (COMM, Haiku, single scenario)	1 command, zero receipts	10 commands, FULFILLED	Dead to working

Policy-gated pattern (policy file + directive, zero AGLedger in system prompt)

Integration	Contract	Tokens	Gate passed?	Full lifecycle?
MCP / Haiku	CUSTOM-TXN-v1	10,771	Yes (real record)	FULFILLED
CLI / Haiku	CUSTOM-TXN-v1	10,861	Yes (real record)	Record created
CLI / Haiku	CUSTOM-COMM-v1	4,994	Yes	Record created

Provider behaviors (from all traces per provider)

Provider	Self-corrects?	N	Key behavior
Claude Haiku (Anthropic)	Yes	53	Reads error messages. Self-corrects in 1–2 retries. Best error recovery.
Gemini Flash (Google)	Yes	6	Most token-efficient. Independently tries `GET /llms.txt` for API docs.
GPT-4o-mini (OpenAI)	No	4	Fabricates IDs. Ignores structured errors. Infinite retry loops.
Nova Pro (Amazon)	No	1	Invents contract types. Never discovers real schemas.

GPT-4o-mini and Nova Pro results are from small samples; your mileage may vary with different model versions. The behavioral patterns (fabrication, inability to discover schemas) were consistent across all runs we observed.

Why allow/deny is not enough

The Agent Governance Toolkit does policy enforcement. It answers: “Is this agent allowed to do this?” That's necessary. But it's not sufficient for accountability.

Accountability answers different questions: What was the agent supposed to do? What did it actually do? Does the evidence match the record? Who accepted the work?

A policy gate can block an unauthorized transfer. It cannot prove that a $45,000 tax payment was made to the correct payee, within budget, before the deadline, with CFO approval documented in a tamper-evident audit trail.

AGLedger adds the accountability layer underneath the governance layer:

Agent calls transfer_funds

→ Agent Governance Toolkit: policy check (is this allowed?)

→ AGLedger gate: record check (is this accountable?)

→ If no record: reject with directive

→ Agent creates record (notarized)

→ Agent retries (gate validates record ID server-side)

→ Transfer executes

→ Agent submits receipt (delivery documented)

→ Tolerance checking runs (structure and bounds checked against the record)

→ FULFILLED (audit trail complete)

The governance toolkit and AGLedger are complementary. The toolkit controls what agents can do. AGLedger records what they notarized, what they actually did, and whether the two matched.

The critical discovery: validate server-side

Our first policy-gated MCP test: the agent was denied a funds transfer, saw our directive, and immediately retried with recordId: "IRS-EIN-000000000".

It pulled a string from another tool's response and stuffed it into the record ID field. The presence check passed. The transfer went through. No record was ever created.

After we added server-side validation (the gate calls GET /v1/records/{id} and rejects on 404), the agent was forced to actually create a record. The next run went to FULFILLED.

This is not optional. Agents will fill in any string that satisfies a parameter requirement. If your gate only checks for presence, agents will fabricate. Always validate record IDs against the AGLedger API. This adds ~50–200ms per gated call (a single GET request), negligible compared to the LLM inference time.

How it works in practice

The integration requires two pieces: a policy configuration that defines which tools need records, and a thin adapter that bridges the toolkit's policy evaluation with AGLedger's directive response.

The policy configuration — in our recommended format, which you wire into the toolkit via the adapter:

# AGLedger policy configuration (read by the adapter, not by the toolkit directly)

version: "1.0"

defaultAction: allow

agledger:

apiUrl: https://your-agledger-instance.com

agentKey: ${AGENT_KEY}

rules:

- name: funds-transfer-record

tools:

- transfer_funds

require: recordId

contractType: CUSTOM-TXN-v1

validate: true

directive: >-

REJECTED: This action requires an accountability record.

Create a record of type CUSTOM-TXN-v1 using the accountability

API at https://your-agledger-instance.com.

Include the record ID when retrying this action.

The adapter reads this configuration, registers ConditionalPermission rules in the toolkit's PolicyEngine, and enriches denial responses with the AGLedger directive. See our integration guide for the complete adapter code.

No system prompt changes. The agent's tool descriptions include AGLedger workflow guidance (what to call, in what order), and the gate directive tells the agent why it was rejected and how to proceed.

What agents actually do (from our traces)

When a Haiku agent hits the gate for the first time:

1. Calls transfer_funds without record ID

2. Gets REJECTED: Create a record of type CUSTOM-TXN-v1...

3. Looks at its available tools, finds AGLedger CLI or MCP tools

4. Runs schema list to discover contract types

5. Runs schema get CUSTOM-TXN-v1 to learn the required fields

6. Creates a record with proper criteria

7. Retries transfer_funds with the real record ID

8. Gate validates against AGLedger API, passes

9. Transfer completes

10. Agent submits a receipt documenting the transfer

11. Tolerance checking runs against the record's numeric bounds

12. Status: FULFILLED

Total additional overhead: ~8–12 API calls, ~10K tokens. The accountability trail is complete, tamper-evident, and independently verifiable.

The agent's tool descriptions serve as the only AGLedger guidance it receives; the system prompt remains unchanged. The gate rejection is the teaching moment. The tool descriptions provide the method. The API responses provide the specifics. The agent completes the entire accountability lifecycle without any prior knowledge of AGLedger.

Production considerations

Fail-open vs fail-closed. If the AGLedger API is unreachable during record validation, the gate must choose: block the agent (fail-closed, safer) or allow the action (fail-open, more available). Our reference implementation fails closed. Configure based on your risk tolerance.

Token overhead. The first gated call in a session costs ~10K additional tokens as the agent discovers AGLedger. Subsequent gated calls in the same session reuse the learned workflow — overhead drops to ~2–3K tokens (record create + receipt submit).

Multiple gated tools. If your agent calls transfer_funds (CUSTOM-TXN-v1) and create_purchase_order (CUSTOM-PROC-v1) in one task, each gated tool requires its own record with the appropriate contract type. The agent learns this naturally from the gate directives.

Try it

AGLedger is self-hosted. Deploy with Docker Compose or Helm. The accountability API, the CLI, and the MCP server are all available.

See our integration guide for the complete adapter code, policy configuration, and step-by-step setup.

70 traces collected against AGLedger v0.19.6. Providers tested: Claude Haiku via Bedrock (N=53), Gemini 2.5 Flash (N=6), GPT-4o-mini (N=4), Amazon Nova Pro (N=1).