← Back to blog
2026-04-06Research

Budget LLMs Outperform Premium Models at Evidence Submission

By Michael Cooper · Founder

Note: In API v0.23.0 (May 2026) the performer's evidence submission was renamed from “receipt” to “Completion” to align with the IETF SCITT vocabulary, where “receipt” now refers to a cryptographic inclusion proof. This post is preserved with original terminology; substitute “Completion” mentally where the post discusses performer evidence.

TL;DR

We ran 27 multi-agent experiments pitting budget models (Claude Haiku, GPT-4o-mini, Gemini Flash) against premium models (Claude Sonnet, GPT-4o, Gemini Pro) on the same collaborative tasks using AGLedger's record/receipt protocol as the coordination layer.

Budget models delivered a 609% receipt-to-record ratio. Premium models delivered 18%.

The cheap models did the work. The expensive models made plans about doing the work.

This matters because the governance gap is real. According to Deloitte's 2026 State of AI report, only 21% of organizations deploying agentic AI have a mature governance model, even as 74% plan to use it within two years.

The experiment

Three agents — one from each major LLM provider — given a collaborative task with AGLedger's contract spec as their only coordination tool. No scripted outcomes. Agents chose how and whether to use records. We observed what happened.

Each experiment ran against both budget and premium model tiers with identical tool profiles, identical prompts, identical tasks. The only variable was the model behind each agent.

The metric that matters: receipt ratio — receipts submitted per record created. A ratio above 100% means the agent submitted multiple pieces of evidence per record. Below 100% means records were created but never fulfilled. This is the closest proxy we have for “did the agent actually do what it said it would do?”

The numbers

From EXP-16 in our A2A experiment series:

MetricBudgetPremium
Receipt ratio609%18%
Behavior patternFewer records, many receiptsMany proposals, few receipts
Rounds to complete1–72
API calls48–7435–38

What this metric is, and what it is not. Receipt ratio measures evidence submissions per record. It is not a count of distinct tasks completed. Budget models over-document — submitting multiple receipts per record — which inflates the ratio relative to raw task completion. We are not claiming budget models complete 33x more tasks. We are claiming they submit 33x more evidence per record under identical conditions, while premium models stop short and write proposals.

The asymmetry is the finding. Premium models optimize for elegance per round (fewer calls, more reasoning per call, richer reflections). Budget models optimize for closing the loop (do the thing, hand back evidence, move on). For accountability infrastructure where evidence is the product, the budget-model behavior is the one that ships.

Doers vs Planners

Budget models are doers. They take a record, execute it, submit evidence, move on. They don't deliberate much. They don't verify cross-agent state. They don't write reflections on the nature of collaboration. They do the thing and report back.

Premium models are planners. They create many proposals. They verify state before acting. They discover cross-agent permission boundaries. They produce “richer reflections.” They are more thorough per round, finishing in 2 rounds vs the budget models' 1–7, using 35–38 API calls vs 48–74.

The premium models are more efficient. Fewer calls, fewer rounds, better per-step reasoning. But efficiency in planning is not the same as completion. The budget models are less elegant but they close the loop.

If you're paying for Sonnet to do the work of Haiku, you're paying for a better plan that never ships.

The GPT-4o-mini evidence problem

Before you conclude that budget models are universally better at execution: GPT-4o-mini has a specific, measurable capability gap.

Across Runs 19–23, GPT-4o-mini consistently failed to construct the evidence field when calling submit_receipt. 38 out of 39 attempts failed in Run 23 alone. It can do the work. It cannot describe what it did in a structured object matching a JSON schema.

This isn't a prompt issue. It's a model capability gap: budget-tier models struggle to construct nested structured objects from schema descriptions. The implication: submit_receipt needs a simplified evidence mode for budget-tier models, or you accept a 97% failure rate on structured evidence from GPT-4o-mini.

What this means for model selection

The conventional wisdom is: use premium models for important work and budget models for simple tasks. Our data suggests the opposite framing is more useful:

  • Execution work (do X, report evidence): budget models. They submit 33x more evidence per record than premium models do, and they are cheaper. The only caveat is structured evidence construction, which is an API design problem rather than a model selection problem.
  • Planning and coordination (decompose a goal, assign sub-tasks, verify dependencies): premium models. Their thoroughness, state verification, and cross-agent awareness are genuine advantages for orchestration.
  • Don't mix them naively. GPT-4o never meaningfully engaged across 6 premium runs (EXP-18) under the tested tool configurations. Model tier is not a linear quality scale; each model has a behavioral profile that either fits the task shape or doesn't.

The real answer is: stop guessing. Measure it.

Measuring it from the chain

The chain produces signals you can measure directly, without standing up a separate eval harness. Each record has a written intent and (where applicable) a receipt and verdict, all signed and timestamped. Over runs, the behavior profile of each model lands in a few quantitative shapes:

  • Receipt ratio — does this model finish what it starts?
  • Evidence quality — can it construct structured proof of delivery against the receipt schema?
  • Timeliness — does it deliver within the record's deadline?
  • Verdict distribution — PASS vs FAIL across records for this performer (on the Verify path).

Run the same task through Haiku and Sonnet. The chain records both. After 100 records you don't have opinions about which model is better. You have measurements you can reason about.

609% vs 18% is not a benchmark; it is a measurable difference in how each model engages with structured work. That is the difference between “scored well on MMLU” and “does what you tell it to do, on this specific kind of task, in your specific environment.”

A formal scoring product is on the roadmap. The signals above are derivable today from the chain itself; we do not ship a separate reputation service.

Key takeaways

  1. Budget models (Haiku, GPT-4o-mini, Gemini Flash) are execution machines — 609% receipt ratio, meaning they over-deliver evidence relative to records.
  2. Premium models (Sonnet, GPT-4o, Gemini Pro) are planning machines — 18% receipt ratio, meaning 82% of their proposals never result in delivered work.
  3. Premium models are more efficient per round (2 rounds, 35–38 calls vs 1–7 rounds, 48–74 calls) but efficiency in planning is not completion.
  4. GPT-4o-mini has a specific structured-evidence gap (38/39 failures) that is solvable through API design.
  5. GPT-4o has a persistent disengagement pattern that is not solvable through API design.
  6. Model selection should be based on measured reputation data, not tier assumptions.

For a business perspective on what “agentic” actually means — and how it differs from traditional AI — see What “Agentic” Really Means for Your Business.

Sources & further reading

Related posts