Budget LLMs Outperform Premium Models at Task Completion
TL;DR
We ran 27 multi-agent experiments pitting budget models (Claude Haiku, GPT-4o-mini, Gemini Flash) against premium models (Claude Sonnet, GPT-4o, Gemini Pro) on the same collaborative tasks using AGLedger's mandate/receipt protocol as the coordination layer.
Budget models delivered a 609% receipt-to-mandate ratio. Premium models delivered 18%.
The cheap models did the work. The expensive models made plans about doing the work.
The experiment
Three agents — one from each major LLM provider — given a collaborative task with AGLedger's contract spec as their only coordination tool. No scripted outcomes. Agents chose how and whether to use mandates. We observed what happened.
Each experiment ran against both budget and premium model tiers with identical tool profiles, identical prompts, identical tasks. The only variable was the model behind each agent.
The metric that matters: receipt ratio— receipts submitted per mandate created. A ratio above 100% means the agent submitted multiple pieces of evidence per commitment. Below 100% means mandates were created but never fulfilled. This is the closest proxy we have for “did the agent actually do what it said it would do?”
The numbers
From EXP-16 in our A2A experiment series:
| Metric | Budget | Premium |
|---|---|---|
| Receipt ratio | 609% | 18% |
| Behavior pattern | Fewer mandates, many receipts | Many proposals, few receipts |
| Rounds to complete | 1–7 | 2 |
| API calls | 48–74 | 35–38 |
Read that again. Budget models completed work at a rate 33x higher than premium models. Not because premium models are worse — because they optimized for a different objective.
Receipt ratio measures evidence submissions per mandate, not distinct tasks completed. Budget models over-document — submitting multiple receipts per mandate — which inflates this ratio relative to raw task completion.
Doers vs Planners
Budget models are doers. They take a mandate, execute it, submit evidence, move on. They don't deliberate much. They don't verify cross-agent state. They don't write reflections on the nature of collaboration. They do the thing and report back.
Premium models are planners. They create many proposals. They verify state before acting. They discover cross-agent permission boundaries. They produce “richer reflections.” They are more thorough per round — finishing in 2 rounds vs the budget models' 1–7, using 35–38 API calls vs 48–74.
The premium models are more efficient. Fewer calls, fewer rounds, better per-step reasoning. But efficiency in planning is not the same as completion. The budget models are less elegant but they close the loop.
If you're paying for Sonnet to do the work of Haiku, you're paying for a better plan that never ships.
The GPT-4o-mini evidence problem
Before you conclude that budget models are universally better at execution: GPT-4o-mini has a specific, measurable capability gap.
Across Runs 19–23, GPT-4o-mini consistently failed to construct the evidence field when calling submit_receipt. 38 out of 39 attempts failed in Run 23 alone. It can do the work. It cannot describe what it did in a structured object matching a JSON schema.
This isn't a prompt issue. It's a model capability gap — budget-tier models struggle to construct nested structured objects from schema descriptions. The implication: submit_receipt needs a simplified evidence mode for budget-tier models, or you accept a 97% failure rate on structured evidence from GPT-4o-mini.
What this means for model selection
The conventional wisdom is: use premium models for important work and budget models for simple tasks. Our data suggests the opposite framing is more useful:
- Execution work(do X, report evidence) — budget models. They complete at 33x the rate. They're cheaper. The only caveat is structured evidence construction, which is an API design problem, not a model selection problem.
- Planning and coordination(decompose a goal, assign sub-tasks, verify dependencies) — premium models. Their thoroughness, state verification, and cross-agent awareness are genuine advantages for orchestration.
- Don't mix them naively. GPT-4o never meaningfully engaged across 6 premium runs (EXP-18) under the tested tool configurations. Model tier is not a linear quality scale — each model has a behavioral profile that either fits the task shape or doesn't.
The real answer is: stop guessing. Measure it.
Measuring it with reputation scoring
This is exactly what AGLedger's reputation system is built for. Every mandate creates a commitment. Every receipt records delivery. Every verdict records acceptance. Over time, you get a quantitative profile of each model's actual behavior:
- Receipt ratio — does this model finish what it starts?
- Evidence quality — can it construct structured proof of delivery?
- Timeliness — does it deliver within the mandate's deadline?
- Verdict distribution — PASS vs FAIL across all mandates for this performer.
Run the same task through Haiku and Sonnet. AGLedger records both. After 100 mandates you don't have opinions about which model is better — you have data.
609% vs 18% is not a benchmark. It's a reputation score derived from real agent behavior on real tasks. That's the difference between “this model scored well on MMLU” and “this model actually does what you tell it to do.”
Key takeaways
- Budget models (Haiku, GPT-4o-mini, Gemini Flash) are execution machines — 609% receipt ratio, meaning they over-deliver evidence relative to commitments.
- Premium models (Sonnet, GPT-4o, Gemini Pro) are planning machines — 18% receipt ratio, meaning 82% of their proposals never result in delivered work.
- Premium models are more efficient per round (2 rounds, 35–38 calls vs 1–7 rounds, 48–74 calls) but efficiency in planning is not completion.
- GPT-4o-mini has a specific structured-evidence gap (38/39 failures) that is solvable through API design.
- GPT-4o has a persistent disengagement pattern that is not solvable through API design.
- Model selection should be based on measured reputation data, not tier assumptions.
Sources & further reading
- EXP-15, EXP-16, EXP-17 — AGLedger A2A Experiment Findings (internal testbed data)
- Anthropic — Claude models overview
- OpenAI — Models documentation
- Orca: Progressive Learning from Complex Explanation Traces (Microsoft Research, 2023)
- Textbooks Are All You Need — Phi-1 (Microsoft Research, 2023)