← Blog

2026-04-13

Research

What We Learned Adapting to Google A2A v1.0

27 experiments. 30 multi-agent runs. 3 LLM providers. One finding that changed how we think about agent interfaces.

Summary

Google released the Agent-to-Agent protocol (A2A) v1.0 on March 12, 2026. We had been running multi-agent experiments against A2A since the preview announcement. When v1.0 shipped, we adapted AGLedger's A2A integration, ran 27 behavioral experiments across Claude Haiku, Gemini 2.5 Flash, and GPT-4o-mini, and found a structural gap: A2A handles task delegation, but agents that use it naturally declare intent without ever closing the accountability loop. The fix is not prompting — it is interface design.

The protocol landscape

The agent ecosystem has converged on a stack. Each protocol solves a different problem:

Discovery — Agent Cards / AGENTS.md — “what can you do?”

CoordinationA2A — “do this task”

ToolsMCP — “call this function”

CommerceUCP — “buy this item”

Payment authAP2 — “you may spend $X”

Accountability — AGLedger — “the work is done — here is proof”

A2A and AGLedger overlap in task lifecycle management. They diverge on what happens after the work is done. A2A: the agent self-reports completion. AGLedger: an independent principal renders a verdict.

What changed in A2A v1.0

The v1.0 spec (March 2026) formalized what the preview hinted at. Key changes we had to adapt to:

Method names standardizeda2a.SendMessage, a2a.GetTask, a2a.CancelTask, a2a.ListTasks. We migrated from preview equivalents.

Task state enums uppercasedTASK_STATE_SUBMITTED, TASK_STATE_WORKING instead of lowercase. ProtoJSON conventions throughout.

Agent Cards signed — JWS + JSON Canonicalization Scheme (RFC 8785). We reuse our existing Ed25519 (RFC 8032) signing infrastructure.

Dual Part format — v1.0 uses direct data field instead of preview's kind. We accept both, output v1.0 only.

Error model shift — Spec requires google.rpc.Status format. Our JSON-RPC errors needed restructuring.

Security requirements — Agent Cards now declare securityRequirements (OAuth2, API key, mTLS, OpenID Connect).

We shipped A2A support in AGLedger API v0.18.0–0.18.1 (adapting to the open-source A2A spec) with 59 unit tests plus 8 new integration tests. Six findings (F-348 through F-353) were caught and resolved during the migration.

The accountability gap

A2A defines an 8-state task lifecycle. AGLedger defines a 17-state mandate lifecycle. They look similar. They are not.

CapabilityA2A v1.0AGLedger
Task delegationYesYes (mandates)
Structured acceptance criteriaNoYes (contract types, JSON Schema)
Independent verdictNo — agent self-reportsYes — principal renders PASS/FAIL
Tolerance checkingNoYes (numeric bounds, auto-settle)
Tamper-evident audit trailNoYes (Ed25519 + SHA-256 hash chain)
Dispute resolutionNoYes (3-tier, 6 grounds)
Delegation constraintsNoYes (constraint inheritance through chains)

The gap is structural, not implementational. A2A tells agents how to talk. It does not tell anyone whether the work was done to spec.

27 experiments

We gave three LLM agents (Claude Haiku, Gemini 2.5 Flash, GPT-4o-mini) a collaborative cloud infrastructure migration task. AGLedger's contract spec was their coordination tool. We varied the interface, the tool count, the tool descriptions, and the available contract types. 30 runs. 27 documented findings (EXP-01 through EXP-27).

Three findings changed how we build:

Agents declare intent but never close the loop

In our baseline run, agents created 8 mandates (declarations of what they intended to do) and submitted zero receipts (evidence of delivery). The lifecycle stopped at ACTIVE. The declaration side works. The closure side does not happen naturally.

This is not a bug in the agents. It is a design gap. A2A's lifecycle ends when the agent says “done.” There is no structural incentive to submit evidence.

Interface design beats prompt engineering — 32x

We tested two approaches to increase receipt submission:

Prompt fix — tool description says “submit_receipt — this is REQUIRED”

Result: receipt ratio went from 20% to 640%. The word REQUIRED is load-bearing.

Interface fix — remove the task dispatcher entirely, make accountability tools the only path

Result: 6 proposals, 56 receipts, 41 auto-settled, 0 dispatcher calls. 933% receipt ratio.

Prompt engineering produced a 32x improvement. Removing the alternative path produced a 47x improvement. When accountability is the only interface, agents use it.

Interface shapes behavior. This is the single most important finding from our A2A work.

Tool count kills compliance

With 36 tools available (31 AGLedger + 5 dispatcher), receipt submission dropped to 0%. Research confirms tool selection accuracy drops from 43% to 14% as tool counts grow. Agents treated the simpler complete_task dispatcher call as sufficient. Why submit structured evidence when you can just say “done”?

With a focused 10-tool A2A profile: 6 proposals, 6 accepts, 8 receipts, 7/7 settlements. All three providers engaged.

Model behavior divergence

Same interface, different behavior:

Budget models (Haiku, Flash) — fewer mandates, 609% receipt ratio. Over-receipt. They do the work.

Premium models (Sonnet, Opus) — many proposals, 18% receipt ratio. Over-propose. They plan the work.

GPT-4o-mini — cannot reliably construct evidence objects from schema descriptions. 38/39 failures in Run 23.

No one-size-fits-all interface. AGLedger's MCP server ships multiple tool profiles because the right interface depends on the model.

What we shipped

Based on these findings, AGLedger v0.18.0–0.18.1 shipped:

A2A v1.0 Agent Card at /.well-known/agent-card.json with skills mapped to accountability operations

JSON-RPC 2.0 bindinga2a.SendMessage, a2a.GetTask, a2a.CancelTask, a2a.ListTasks

Two new contract types — ACH-ANALYZE-v1 (cognitive work) and ACH-COORD-v1 (coordination), added because agents told us the existing types did not cover their tasks

Focused 10-tool A2A profile for MCP — propose, accept, list, submit, settle, transition, get, search, reputation, help

Auto-verify — eliminated verification latency. 43 receipts, 0 errors, 0 dispatcher calls in testing

urn:agledger:* extensions for accountability semantics (mandate metadata, verdict, audit proof, receipt schema, dispute, reputation)

Where AGLedger sits

A2A is the coordination protocol. AGLedger is the accountability layer. They are complementary, not competing.

A2A tells agents how to delegate. AGLedger answers the question no other protocol in the stack addresses: was the work actually completed to spec, and can you prove it?

A2A tells agents how to talk. AGLedger tells everyone whether the work was done.

Sources & further reading

A2A v1.0 Specification — Agent-to-Agent Protocol, Linux Foundation

A2A v1.0 Announcement — March 2026 release notes

Google A2A Announcement — Original protocol announcement (April 2025)

A2A GitHub — Specification source, SDKs, and samples

Model Context Protocol — Anthropic's open protocol for LLM tool integration

AP2 Specification — Agent Payment Protocol (Google + Coinbase)

Universal Commerce Protocol — Commerce standard for AI agents (Google + Shopify)

RFC 8032 — Ed25519 Digital Signatures

RFC 8785 — JSON Canonicalization Scheme

RFC 9421 — HTTP Message Signatures

arXiv 2505.03275 — RAG-MCP: Mitigating Prompt Bloat in LLM Tool Selection

Stripe Webhook Signatures — Industry-standard HMAC-SHA256 webhook verification pattern

Related