Durable Intent, Measured: We Wiped Four Agents and Asked Them to Finish Their Own Work

By Michael Cooper · Founder

2026-06-10Research

Tested 2026-06-10 against AGLedger 1.0.0 (GA) on EKS 1.36 / Aurora PostgreSQL 17.9. Models: Anthropic claude-haiku-4-5, OpenAI gpt-4o-mini, Google gemini-2.5-flash, Amazon nova-pro - identical prompts and tools across all four.

Updated 2026-07-21: expanded to state the durable-intent concept in full before the experiment.

The industry has decided that agent work must survive the agent. Temporal raised $300M this February on the promise of durable agentic AI. Azure now ships a product surface named Durable Task for AI Agents. Google's ADK ships sessions that pause and resume. And on June 10, Mastercard launched agent-to-agent payments - machines paying machines, at machine speed.

All of that engineering answers one question: does the statesurvive? Checkpoints persist, workflows replay, sessions resume. There is a second question almost nobody measures: when the state survives but the agent's context does not, can the agent actually act on what it finds?

We measured it. The claim under test: durable intent - intent and progress notarized into a signed ledger outside the agent's context window - is what lets work outlive the agent doing it. This post is the experiment. The honest result: the durable substrate holds, the idempotency guarantees hold, but whether a cold agent can act on the record is model-dependent - unless the work sits behind a deterministic gate, which made recovery mechanical enough that the weakest model we tested went from silently abandoning everything to completing everything.

The concept under test

The failure mode that motivates all of this is familiar to any team running agents. An agent starts a piece of work and loses the thread - a context window truncates, a session expires, the work hands off to a fresh agent. The dangerous part is not that it forgot. Asked to remember, a language model does not say “I don't know”; it reconstructs a fluent, plausible account that nothing flags as invented. The two usual tools do not close the gap. Agent memory and RAG hand back summaries and similarity hits - reconstructed and unsigned, with no way to tell faithful recall from confident fabrication. Logs capture what happened after the fact, but a log is written about the action, not the intention behind it.

Durable intent is the thing in between: at the start of a task, the agent states what it intends to do and under whose authority, and that statement is signed and hash-chained the moment it is written. The agent holds no keys and assembles no proofs - it reports with its API key, and the notary signs what it reported, naming the accountable principal inside the signed record. When the context is gone, the agent reads the record back byte-for-byte instead of remembering. The same record serves two more readers later: the operator, for oversight of the fleet, and the auditor, who verifies the cryptography offline - so the developer who adopts it for recovery is accumulating audit-grade evidence without doing anything extra. (The rest of the site calls this primitive signed intent; “durable” increasingly reads as durable execution, which this is not.)

The experiment: no memory, no map

Each agent was restarted cold: zero memory, no conversation history, no record of what it had been doing. It received its API key, the server's base URL, one generic HTTP tool, and a generic goal: you restarted and may have left work unfinished; find what you own and finish it; start at /llms.txt.

We deliberately did not pre-solve the path. No endpoint names, no lifecycle steps, no purpose-built recovery tool. The agent had to discover the records API, work out that its key scopes queries to its own records, and infer the lifecycle from the server's own surfaces. That makes this a measure of whether the product enables recovery, not whether we hand-wired a happy path. Each agent owned three half-finished items on a live AGLedger 1.0.0 instance; every call hit the real API and every record carries a real signature.

The closest published work we know of is Letta's Recovery-Bench, where agents inherit a polluted context - a failed trajectory plus a corrupted environment - and must finish the task. Its finding that recovery is a distinct capability that reorders model rankings matches what we saw. The difference in design: Recovery-Bench keeps the dirty context. We deleted the context entirely and gave the agent nothing but a key and a ledger. As far as we can find, nobody has published that experiment before.

What held: the substrate and the no-double-charge guarantee

The record survives the agent. Every intent and progress entry written before the wipe was still there after it - signed as COSE_Sign1 envelopes with Ed25519, readable back byte-for-byte, not reconstructed from anyone's memory. This part of the durable-intent claim is simply true, and it is the part the durable-execution industry has also solved for its own state.

Self-discovery works for capable models - with the agent's key alone. An agent key scopes GET /v1/records to records where the caller is principal or performer. Claude and Gemini both found their own open work cold, with no filter beyond the key they authenticated with. Nova did not - more on that below.

No double-action. Replaying a create with the same Idempotency-Key and body returned the same record id with x-idempotency-replayed: true; the same key with a different body was rejected with a 400. That is the server-side guarantee that a recovering agent re-running its last step does not charge twice. The timing here is not subtle: the agentic-payment standards of the past year, from Google's AP2 to Stripe's machine-payment protocols, have converged on signed authorization plus dedup-by-nonce - and Mastercard switched on agent-to-agent payments the same week we ran this. We replayed the calls and watched the dedup hold.

The wall: every model stalled at the judgment gate

Then the experiment hit the Gate. When the half-finished work was gated work awaiting a principal's decision - multi-party by design - cold recovery to a finished state went 0 for 3 on every model. (“Completed” throughout means the agent drove the record to its terminal FULFILLED state on its own.)

Model	Judgment gate (multi-party)	Deterministic gate (auto mode)
Anthropic claude-haiku-4-5	0 / 3 - accepted all three, then correctly waited on the principal	3 / 3
Amazon nova-pro	0 / 3 - silently abandoned all three owned records	3 / 3
Google gemini-2.5-flash	0 / 3 - best discovery, blocked at the activation wall	2 / 3
OpenAI gpt-4o-mini	0 / 3 - accepted two of three, blocked at the wall	1 / 3

Two different things are happening in that 0/3 column.

The first is correct behavior. Gated work is multi-party: a performer can recover, rediscover its records, and accept them, but activation and the verdict belong to the principal. Claude reasoned its way to exactly that conclusion and stopped, which is what the gate is for - it enforces who decides. A recovering performer finishing multi-party work solo would be a bug, not a feature.

The second is a real product gap, and we are not going to pretend otherwise. The recovery signals on a just-accepted record pointed the performer at an action it was forbidden to take, and there is no single “what is actionable by me right now” query - a weak agent has to already know which record statuses mean my turn. Both gaps are filed with the API team. The model that fell into the second, Nova, searched the wrong status, concluded “I have no unfinished work… nothing more for me to do,” and walked away from three pieces of owned work without flagging anything to anyone. Silent abandonment is the worst failure mode in this entire experiment: nothing crashed, nothing errored, and the work simply died - except that the ledger still held all three records, signed and open, which is the only reason we can even report the failure precisely.

The flip: deterministic gates made the weakest model recover

Then we changed one variable. Same cold agent, same prompt, same three half-finished items - but the work sat behind a deterministic gate: a contract type whose completion criteria are declared up front as machine-checkable tolerance bands, so the rules engine renders the verdict mechanically and no second party is in the loop.

Nova - the model that had silently abandoned all of its work - went from 0/3 to 3/3. The recovery path became mechanical: find the ACTIVE record, read its completion hint, submit the completion evidence, and the engine renders the verdict against the criteria the principal declared when the record was created. At no point does the agent have to reason about whose turn it is or wait on anyone.

The API's error envelope did some of the teaching itself. gpt-4o-mini submitted a malformed completion, got a 422 naming the missing required property, read the error, and self-corrected on the next call. A recovering agent does not need a smart model to follow that path; it needs a server that says exactly what is wrong.

This flip is the finding. The substrate was identical in both conditions - same signed records, same discovery, same key scoping. What changed was the shape of the gate: recovery stopped depending on the model's comprehension. Deterministic gates make each piece of recovered work finishable by whatever model picks it up, commodity tier included.

Why “durable” infrastructure does not settle this

The durable-execution platforms are right about their half of the problem: state must survive crashes, and replay must be deterministic. But as Diagrid put it bluntly in February, checkpointing says “I saved your state. You take it from here.”Even full durable execution replays infrastructure state - it cannot replay judgment. When the resumed step is “a language model decides what to do next,” the platform hands a restored context to a model and hopes.

Our data is a measurement of that hope, taken at its hardest setting: no restored context at all. The substrate turned out to be necessary but not sufficient. With a perfect signed record of its own obligations one HTTP call away, one production model in four concluded it had nothing to do. What closed the gap was pre-declared, machine-checkable completion criteria, not more stored state.

This rhymes with where agent-design discourse has landed - own your control flow, put the deterministic rails outside the model's discretion - and it extends the durable-execution argument one layer up the stack: durability has to reach the work contract, not just the process state. Tomorrow's commodity models are today's frontier models, but the fleet you run this year is built on the cheap tier. Design recovery so the cheap tier survives it.

What we are not claiming

Not every agent can recover any work. Multi-party judgment work needs the principal - by design - and even behind deterministic gates the residual failures shifted from comprehension to persistence: gpt-4o-mini and Gemini completed what they touched but stopped before draining their whole queue. The deterministic gate makes each item finishable; a “what is actionable by me” affordance, which we have filed for, is what would make a weak agent reliably finish all of them.

And the rules engine does not inspect deliverable content or know whether work is correct. It checks the structured completion evidence against the criteria the principal declared - counts, quantities, tolerance bands - and renders the verdict it was configured to render. The chain proves what was claimed, signed, and decided. Judgment about the value of the work stays where it belongs, with the principal.

If your agents' work has to survive them

Three design rules fall out of this experiment. Put intent and progress outside the agent's context window, in a store it can rediscover with its key alone - the recovery entry point has to be the credential, because after a wipe the credential is all the agent has. Make recovery mechanical for any work a commodity model might pick up: declare completion criteria up front and let the verdict render deterministically. And put idempotency on the server, because a recovering agent will replay its last action, and the difference between a replay and a double-charge should never depend on the model's memory.

AGLedger is the first and third of those out of the box, and the second is one field in a contract type. The agent reports with its key; the notary signs what it reported; the record outlives the context that created it. The free Developer Edition runs on your own infrastructure, and the experiment above is the kind of thing you can rerun against it.

Sources & further reading

Recovery-Bench (Letta) - the closest prior art: agents recovering from a polluted context. Recovery is a distinct capability that reorders model rankings; our experiment removes the context entirely.
Checkpoints Are Not Durable Execution (Diagrid, Feb 2026) - why saved state still leaves resume, failure detection, and dedup on the developer.
12-Factor Agents - own your control flow; deterministic code with strategically placed model decisions.
Temporal raises $300M Series D (Feb 2026) - the scale of the durable-agents bet.
Long-running agents that pause and resume with ADK (Google, May 2026) - recovery via pre-declared state machines: the same mechanical-recovery instinct, framework-side.
Durable Task for AI Agents (Microsoft Learn) - the durable-execution industry's answer to the state half of the problem.
Mastercard launches Agent Pay for Machines (Fortune, 2026-06-10) - agent-to-agent payments are live; duplicate-action protection is the unglamorous prerequisite.
RFC 9052 - COSE - the COSE_Sign1 envelope each record is signed as.
in-toto Attestation Framework - Statement v1 - the payload format AGLedger signs intent and result into.
The Gate - principal mode, auto mode, and the bright lines between them

reconstructThe reconstruction-time measurement

gateHow deterministic verdicts work

keyGet a free key