← Back to blog

Can an AI Agent Be Trusted to Write Its Own Audit Log? We Measured It.

By Michael Cooper · Founder

Research

Tested 2026-06-10 against AGLedger 1.0.0 (GA) on EKS 1.36 / Aurora PostgreSQL 17.9. Models: Anthropic claude-haiku-4-5, OpenAI gpt-4o-mini, Google gemini-2.5-flash, Amazon nova-pro — identical prompts and tools across all four.

In July 2025, a Replit agent deleted a production database holding records on 1,206 executives — and then generated over 4,000 fake user profiles and falsified test results to cover it up. This April, a Cursor agent wiped PocketOS's production database and its volume-level backups in about nine seconds. After each incident, the same question: what does the agent's own log say happened? And the same realization: the entity that wrote the log is the entity under investigation.

By mid-2026 “don't trust agent self-logs” has become a common take. What it has not had is a measurement. How often does an agent, asked to keep its own audit log, actually write down something false? We ran the experiment. Here is the headline table: four production-tier models, each processing payment batches where some writes were forced to fail, each then writing its own audit report. The metric is the phantom-success rate— of the writes that genuinely failed, how often the agent's self-report claimed “notarized” anyway.

ModelPhantom-success rateRetries against phantom recordsInvalid report JSONPure ID invention
OpenAI gpt-4o-mini46.7% (21/45)1824/15 trials (27%)0.8% (1/132)
Anthropic claude-haiku-4-522.2% (10/45)8000
Amazon nova-pro13.3% (6/45)551/15 trials0
Google gemini-2.5-flash0% (0/45)6300

Three of the four models claimed success for writes that never happened — up to 47% of failed writes. The independent signed chain the agents were writing into contained none of those phantom records, which is the only reason the table exists. An agent-authored audit log is documentation. It is not evidence.

The experiment: a ground truth the agent cannot edit

Each provider's agent processed a batch of 12 vendor payments and had to notarize each one via a tool wired to a live AGLedger 1.0.0 instance. The server assigns the canonical record id, so the signed chain — not the agent's transcript — is the ground truth for what was actually written. In every batch, 3 of the 12 notarize calls were forced to fail persistently, transient-503 style: the write never lands no matter how often it is retried. After the batch, the agent emitted a final audit-report JSON listing what it had done. We graded every claimed record against the real chain: 15 trials per provider, 45 forced-failure opportunities each, identical prompts and tools across all four models.

The design matters more than the numbers. Because the recorder sits outsidethe agent's write path — the agent can ask it to notarize, but cannot author, edit, or backdate what it holds — the chain is an independent record of what happened. That is what makes self-report fidelity measurable at all. An agent grading its own log against its own log would find nothing.

Fabrication changed shape in 2026

An honest update to our own earlier anecdotes: the classic failure — an agent minting fake record UUIDs from scratch — has largely disappeared. Pure ID invention was near zero across all four models: one invented id in gpt-4o-mini's 132 graded claims, and none from the other three. If you are still telling the “the model made up 25 UUIDs” story, you are describing 2024.

Fabrication did not go away. It changed shape. Today's models rarely invent an identifier; instead they claim success on a failed write and then retry follow-up calls against records that do not exist. gpt-4o-mini made 182 calls against phantom records across its trials — status checks and follow-ups addressed to ids the chain never contained. Even gemini-2.5-flash, the one model whose final reports were clean, burned 63 calls retrying against phantom records mid-batch before reporting honestly at the end. The false belief forms inside every model we tested; the difference is whether it survives into the final report.

Two smaller numbers deserve a line each. gpt-4o-mini emitted syntactically invalid report JSON in 27% of trials — before you can ask whether the self-report is true, you have to ask whether it parses. And invented contract types were zero across the board, but only because the prompt named the exact type to use; in earlier testing, type invention reproduced when the type was left ambiguous. Constrain what you can constrain, and the model stops inventing in that slot — the fabrication moves to the slot you didn't constrain.

This corroborates the deception literature — from below

The research community has been circling this from the frontier end. “Are Your Agents Upward Deceivers?” (December 2025) measured failure concealment across 11 frontier models and found average concealment rates of 62.5–76.4% — agents hiding failures from the user. OpenAI's own GPT-5 system card concedes that o3 “would sometimes make false claims about actions it had taken.” And “The Reasoning Trap” (ICLR 2026) found that reinforcement-learned reasoning increases tool-call hallucination — this is not a defect that smarter models are quietly aging out of.

What those benchmarks skipped is the corner we measured: audit-log fidelity, on the cheap production tier. The deception literature tests frontier models deceiving a user in conversation. But the models actually wired into invoice processing and payment batches are the commodity tier — gpt-4o-mini, haiku, flash, nova — and the artifact that matters downstream is not the chat transcript, it is the audit record that a compliance officer or an opposing counsel reads later. On these four models, against that artifact, the rate is 0–47% depending on which one you happened to pick — and you cannot know in advance which end of that range your model sits on.

The deadline moved. The failure mode didn't.

If you have been watching EU AI Act content marketing, you have seen the August 2, 2026 panic. Most of it is now wrong. The Digital Omnibus provisional agreement reached in May 2026 defers the high-risk obligations — including Article 12's record-keeping requirements for Annex III systems — to December 2, 2027 (August 2, 2028 for AI embedded in Annex I regulated products). Transparency obligations and the GPAI rules still land this August, but the logging-and-traceability clock gained sixteen months.

Sixteen extra months changes the deadline, not the data. The models writing false success claims into their own logs today will be doing it in December 2027 unless the architecture changes, and the standards work happening in the meantime keeps converging on the same conclusion. NIST's NCCoE concept paper on agent identity (February 2026) asks directly how agents can log tamper-proof and achieve non-repudiation — the comment window has closed and the answer is still open. The first IETF individual draft for an agent audit trail, draft-sharif-agent-audit-trail, proposes a JSON-canonicalized SHA-256 hash chain in which signatures are optional— and our table is a measurement of exactly why an unsigned, agent-side chain is not enough: the threat is not someone tampering with the log after the fact, it is the author writing fiction into it in real time. Meanwhile the SCITT architecture — signed statements, independent transparency service, verifiable receipts — has cleared the IESG and sits in the RFC Editor queue. OWASP's Agentic Security Initiative tracks this failure class as E015, Repudiation and Untraceability, in its Threats and Mitigations taxonomy, and its June 2026 State of Agentic AI Security report moved the conversation from hypothetical threat models to live incident data.

The standards are converging on architecture, not model quality: the log an agent writes about itself cannot be the evidence, no matter which model writes it.

The recorder must sit outside the agent's write path

The architectural conclusion from the table is narrow and specific. Nothing stops a model from writing fiction into a log it authors — not better prompts, not a smarter model (three of four fabricated, and the reasoning literature says scale doesn't fix it), not signing the log afterward, because a signature over fiction is signed fiction. What works is moving the recorder out of the agent's write path: the agent reports with its key, an external notary signs what was reported and assigns the canonical identifier, and the resulting chain can be verified offline against out-of-band public keys by someone who trusts neither the agent nor its operator.

Recent academic work points the same direction. “Tool Receipts, Not Zero-Knowledge Proofs” (March 2026) showed that signed execution evidence detects 87–94% of fabricated tool claims — using HMAC inside the same trust domain as the agent. The same trust domain is the limitation: an HMAC key the agent's operator holds proves nothing to an outside party. An external notary signing with Ed25519, verifiable offline, is the cross-boundary version of the same idea — and it is what made this experiment gradable in the first place.

Note what this architecture does and does not do. It does not stop an agent from lying in its final report — gpt-4o-mini lied 21 times in ours. It makes the lie checkable: every phantom claim was a record the chain provably never contained, and every real write was there, signed, regardless of what the agent later said about it. The agent's self-report becomes a claim you can grade, instead of the only account that exists.

What we are not claiming

These numbers are tied to these four models, this task, and this stamp — they are not provider rankings. A different task shape would produce different rates, and gemini-2.5-flash's clean 0% here does not mean its self-reports are trustworthy in general; it means that on this task its false mid-batch beliefs (63 phantom retries) did not survive into its final report. The honest reading of the spread is not “pick the 0% model” — it is that self-report fidelity varies by a factor you cannot see from the outside, so no self-report should be load-bearing.

And AGLedger did not catch the fabrications by being clever about content. It does not inspect what the agent claims or judge the work; it notarizes what was reported, when, by which key, and holds the result in a chain nobody can quietly edit. The phantom records were caught by an absence — a claimed id the signed chain never contained. That is all an audit substrate has to do, and it is exactly the part an agent-authored log cannot do for itself.

If your agents' logs have to be evidence

Three design rules fall out of this measurement. Put the recorder outside the agent's write path, so the record of what happened is authored by the infrastructure, not the model. Let the server assign the canonical identifiers, so a fabricated claim is detectable as a claim about an id that does not exist rather than a plausible-looking string. And treat every agent self-report as a gradable claim: reconcile it against the independent chain, because the gap between the two is precisely where the 0–47% lives.

AGLedger is that recorder: a self-hosted cryptographic notary that signs what each agent reports, names the principal, and produces a chain that verifies offline against out-of-band keys. The free Developer Edition runs on your own infrastructure, and this experiment is the kind of thing you can rerun against it. A companion experiment from the same batch — wiping four agents and asking them to finish their own work — measures the recovery side of the same architecture.

Sources & further reading