2026-04-06Research

Designing APIs for AI Agents: Lessons from 3 LLM Providers

We ran 27 experiments across Claude Haiku, GPT-4o-mini, and Gemini 2.5 Flash — testing how many tools agents can handle, which wording makes them complete work, and whether MCP wrappers help or hurt. The results changed how we design our API.

The setup

AGLedger's accountability protocol has a simple lifecycle: a principal assigns a mandate, an agent delivers a receipt, and the principal renders a verdict. The question we wanted to answer was practical: can budget-tier LLMs actually complete this lifecycle through an API, without scaffolding?

We gave three LLM providers the same collaborative task and the contract spec as their coordination tool. No scripted outcomes — agents chose how and whether to use mandates. We varied the tool count, the tool descriptions, and the interface layer (MCP wrapper vs. raw HTTP) across runs.

Three providers tested: Claude Haiku (Bedrock), GPT-4o-mini, and Gemini 2.5 Flash. Premium variants (Sonnet, GPT-4o, Gemini Pro) were also tested for comparison, but budget models were the primary target — if the API works for the cheapest models, it works for everything.

Finding 1: 36 tools → 0% receipt completion

Our initial MCP profile exposed 31 AGLedger tools plus 5 dispatcher tools — 36 total. The result: receipt submission dropped to 0% across all three providers. Not low. Zero.

GPT-4o-mini's engagement did improve (27 contract calls vs. 0 in earlier runs), but no agent closed the loop with evidence submission and settlement. The root cause was clear: when MCP tools coexist with a simpler dispatcher, agents treat complete_task as sufficient closure. Why write a structured receipt when you can just mark the task done?

36 tools overwhelm agents. They default to the path of least resistance.

Finding 2: 10 focused tools → full lifecycle closure

We cut the profile to 10 tools purpose-built for the A2A coordination pattern: propose, accept, submit receipt, search, and a few supporting operations. We also added agent-scoped API keys, auto-activation after accept, and a self-assign guard.

The result: 6 proposals, 6 accepts, 8 receipts, 7/7 settlements. All three providers engaged. The same agents that produced 0 receipts with 36 tools achieved full lifecycle closure with 10 focused tools.

In the best configuration (auto-verify + focused A2A profile), we measured 43 receipts with 0 errors and 0 dispatcher calls — a 717% receipt ratio (receipts per mandate). Agents were over-documenting, submitting multiple receipts per mandate because the interface made it easy and the descriptions made it clear.

Configuration	Receipt ratio
36 MCP tools + dispatcher	0%
10 focused A2A tools	117%
10 tools + auto-verify	717%
No dispatcher (accountability only)	933%

Finding 3: One word changed receipt ratio from 20% to 640%

This was the most surprising result. We tested two versions of the same tool description for submit_receipt:

Version A — 20% receipt ratio

“call submit_receipt (auto-settles)”

Version B — 640% receipt ratio

“submit_receipt — this is REQUIRED”

The word “REQUIRED” is load-bearing. When the description mentioned “auto-settles,” agents interpreted that as “optional” — the system will handle it, so I don't need to. When the description said “REQUIRED,” agents complied.

The lesson: tool descriptions are prompt engineering for agent behavior. If you tell agents something auto-settles, they skip it. If you tell them it's required, they do it. The auto-settle mechanism should be invisible in the tool description. The messaging should say submit_receipt (REQUIRED) without mentioning what happens after.

Finding 4: Raw HTTP outperformed MCP wrappers

When we tested the MCP wrapper, agents made 28 contract calls with 0 receipts, and GPT-4o-mini made 0 calls entirely. When we switched to the raw HTTP API as the only tool, agents made 41 calls with 0 errors, and all three providers engaged.

In zero-scaffolding tests (agent starts with nothing but an HTTP tool and a URL), the HTTP + /llms.txt path achieved 100% lifecycle completion. The agent discovered the API from scratch, created a mandate, submitted a receipt, and reached FULFILLED in ~6 HTTP calls and 24 seconds.

MCP with 11 tools achieved receipt submission but got stuck at ACTIVE — async verification completed after the agent checked status. The SDK path achieved 0% mandate creation due to field name mismatches and schema strictness that HTTP agents handled by retrying with exact fields from the error response.

Interface	Mandate creation	Lifecycle completion
HTTP + /llms.txt	100%	100%
MCP (11 tools)	100%	0%*
SDK (typed tools)	0%	0%

* MCP submitted receipts correctly but checked status before async verification completed — functional gap, not a protocol failure.

Finding 5: Provider behavior diverges more than you'd expect

In the initial MCP experiments, Claude Haiku made 11 contract calls, Gemini Flash made 12, and GPT-4o-mini made 0. GPT did all its work through the dispatcher without ever creating a mandate. If the accountability spec is voluntary, some agents won't use it.

GPT-4o-mini also showed a persistent capability gap: across runs 19–23, it consistently omitted the evidence field when calling submit_receipt. 38 out of 39 attempts failed in Run 23. It struggles to construct structured objects from schema descriptions.

Once we moved to the focused 10-tool profile, all three providers engaged. In overnight validation runs (v0.10.10, March 29), the final numbers were:

Provider	Lifecycle	Receipt	Tokens
Gemini 2.5 Flash	5/5 (100%)	5/5 (100%)	69K
GPT-4o-mini	5/5 (100%)	5/5 (100%)	70K
Bedrock Haiku	5/5 (100%)	5/5 (100%)	123K

What we changed

These experiments directly shaped our API design. Here's what we shipped based on the findings:

1. Fewer tools, not more.Our MCP profile went from 36 tools to 10. Every tool we removed improved completion rates. The right question isn't “what can the agent do” — it's “what does the agent need to do right now.”

2. Tool descriptions are prompts.We rewrote every tool description as if it were a prompt instruction. No convenience features mentioned. No “auto” anything. Every required action is labeled REQUIRED. The description tells the agent what to do, not how the system works internally.

3. The API is the coordination mechanism. We stopped treating the API as a parallel system alongside a dispatcher. When the accountability API is the only way to coordinate, agents use it. When it competes with a simpler alternative, agents take the shortcut. Receipt ratio went from 0% (with dispatcher) to 933% (without dispatcher).

4. HTTP-first, always. Raw HTTP with /llms.txt for discoverability is our primary integration path. MCP and SDK are convenience layers. The API needs to be good enough that an agent with nothing but http_request and a URL can complete the full lifecycle.

5. Error messages are onboarding. All 15 agents in our overnight run said the same thing: they learned the API from error messages. We added schema URLs to validation errors, accepted field name aliases, and included evidence examples in schema responses. Every 400 response is now a tutorial.

6. Design for budget models.If Claude Haiku, GPT-4o-mini, and Gemini Flash can complete the lifecycle, every model above them can too. Budget models exposed every friction point — confusing field names, missing examples, structured object construction failures. Fix it for the cheapest model and you fix it for everyone.

The bigger picture

The conventional wisdom in agent tooling is “give agents more capabilities.” Our data says the opposite. Agents perform better with fewer, clearer tools. The interface shapes behavior more than the model does.

This has implications beyond AGLedger. If you're building an API that agents will consume, the design principles are different from human-facing APIs. Agents don't read documentation pages — they read tool descriptions and error messages. They don't browse options — they pick the first tool that looks relevant. They don't infer intent from context — they follow instructions literally.

Design accordingly.

Sources & further reading

The experiments referenced in this post are from AGLedger's internal testbed (EXP-10 through EXP-14, EXP-22, EXP-27) and overnight validation runs (v0.10.10, 2026-03-29).