Why does llama.cpp crash with vk::DeviceLostError on long prompts on AMD iGPUs?

Each prefill compute submission covers -ub (micro-batch) tokens. At high context, attention over the large KV cache makes a single submission expensive enough to exceed the amdgpu compute-ring watchdog timeout. The kernel resets the ring to recover the GPU, the Vulkan device is lost, and llama-server terminates with vk::DeviceLostError. On a 96 GB Strix Halo box this reproduced at roughly 88,000 tokens of prefill with -ub 2048 and roughly 80,000 with -ub 1024.

Does flash attention fix the amdgpu ring timeout crash at long context?

No. With -fa on and -ub 2048 the crash still occurred at roughly 88,000 tokens of prefill on a Strix Halo (gfx1151) box. The fix is reducing the micro-batch size: -ub 512 was the only tested configuration that completed a full ~110,000-token prefill.

What does the -ub flag do in llama.cpp and what should it be on Strix Halo?

-ub sets the micro-batch size: how many tokens each compute submission to the GPU covers during prefill. Larger values are faster at short context but make individual GPU submissions longer. On Strix Halo at long context, -ub 512 is the stability ceiling: -ub 1024 and -ub 2048 both trip the GPU watchdog past ~80,000-88,000 tokens. Use -ub 512 if you intend to use context windows beyond roughly 64k tokens.

Can gpt-oss-120b really use its full 131,072-token context window on a 96 GB Strix Halo box?

Yes. Memory is not the constraint — the model sits at ~65 GiB of 89 GiB at full context because gpt-oss uses GQA and sliding-window attention, so 128K costs only ~3 GiB more than 32K. With -ub 512 and flash attention, ~110,000-token prefills complete reliably (~195 tok/s prefill for a single prompt; ~100-113 tok/s sustained across back-to-back runs), and a needle-in-a-haystack fact planted at 10, 50, and 90 percent depth of a ~110k-token document was retrieved exactly at all three depths, including the lost-in-the-middle position.

← Blog

2026-06-05

Engineering

The 88,000-Token Crash: Making gpt-oss-120b Survive Its Full 128K Context on Strix Halo

By Michael Cooper · Founder

A 128K context window that allocates is not a 128K context window that works. This post is the gap between the two on a $2,300 Strix Halo box: a GPU crash that only appears past ~80,000 tokens of prefill, the kernel log that explains it, the flag matrix that isolates it, and the single llama.cpp setting that fixes it — with the throughput bill, measured.

The hardware and stack are covered in the build post (96 GB GMKtec EVO-X2, AMD Ryzen AI MAX+ 395 / Radeon 8060S iGPU, Ubuntu 26.04, llama.cpp with the Vulkan backend on stock Mesa/RADV) and reproducible step-by-step in the recipe.

Summary

With micro-batches of 1024 or 2048 tokens, any prompt past roughly 80,000–88,000 tokens kills the GPU mid-prefill: a single compute submission exceeds the amdgpu watchdog, the kernel resets the compute ring, the Vulkan device is lost, and llama-server core-dumps. Flash attention does not prevent it.

-ub 512 does. It is the only configuration we tested that completes a full ~110,000-token prefill, and it then held one hour of sustained 110k-token loads with zero device resets and perfect needle-in-a-haystack retrieval at every depth. The price: ~195 tok/s prefill at the long end, versus faster-but-crashing larger batches.

Memory was never the problem

The intuitive worry about running a 120B model at 128K context on a 96 GB box is memory. That worry is wrong on this model. Raising -c 32768 to -c 131072moved the box from ~62 to ~65 GiB used — about 3 GiB for a 4× larger window, with ~24 GiB still free. gpt-oss uses grouped-query attention plus sliding-window attention on alternating layers, so its KV cache is dramatically cheaper than a full-attention model of the same size. The full native window allocates with room to spare.

So the window opens fine. The first conclusion we wrote down was “128K holds.” It allocates. Using it was another matter.

The crash: device-lost at ~88,000 tokens

We ran a context-fill curve — a trivial question appended after K tokens of deterministic filler — to measure prefill latency against length. On the as-shipped configuration (-ub 2048, no flash attention):

Case	Prompt tokens	Latency	Prefill tok/s
16k	14,718	20.4 s	722
32k	29,337	33.5 s	877
64k	58,587	142.2 s	412
96k	~88,000	crashed mid-prefill	—

The kernel log tells the whole story in four lines:

amdgpu: ring comp_1.1.0 timeout, signaled seq=330115, emitted seq=330117
amdgpu: Starting comp_1.1.0 ring reset ... device wedged, but recovered through reset
llama-server: terminate called after throwing 'vk::DeviceLostError'
              what():  vk::Queue::submit: ErrorDeviceLost
systemd: llama-server.service: Failed with result 'core-dump' ... Scheduled restart

Read the sequence counters: signaled 330115, emitted 330117. The GPU was two submissions from caught-up. This is not a slowly-degrading queue backing up — one individual compute submission stalled long enough to exceed the amdgpu watchdog, and the kernel did exactly what it is designed to do: reset the compute ring and recover the device. The recovery works — the GPU survives — but the Vulkan device handle is gone, llama-server throws and core-dumps, systemd restarts it (~45 s back to healthy), and the client gets a dropped connection partway through the prefill.

The mechanism: submissions grow with context

Each prefill submission to the GPU covers -ub(micro-batch) tokens. The cost of that submission is not constant: every micro-batch attends over the entire KV cache built so far. At 16k context a 2048-token submission is quick; at 88k context, the same 2048-token submission carries attention over 88,000 cached tokens — on an iGPU, that single submission now runs long enough to look like a hang to the kernel's watchdog. It is not a hang. It just is not finished. The watchdog cannot tell the difference.

That framing makes a prediction: shrink the submission, survive the window. So we tested one variable at a time.

One variable at a time

Config	64k	~80k	~88k	~110k
-ub 2048, no FA (as shipped)	OK	—	DeviceLost	—
-ub 2048 + -fa on	—	—	DeviceLost	—
-ub 1024 + -fa on	—	DeviceLost	—	—
-ub 512 + -fa on	—	OK	OK	PASS

Three findings out of that table:

Flash attention did not fix the crash. -fa on with -ub 2048died at the same ~88k. FA reduces memory traffic and helps throughput, but the watchdog math is about wall-clock per submission, and at this scale FA alone does not bring a 2048-token submission under the limit. (We keep it on anyway — it is otherwise sensible.)
-ub 1024 is not a compromise; it is a slower way to crash. It died earlier(~80k), and its throughput advantage evaporates as context grows — ~360 tok/s at 55k decaying to ~155 tok/s by 80k, converging on what -ub 512 delivers anyway. At long context you pay the stability penalty without keeping the speed.
-ub 512 completed the full window. 109,755 tokens of prefill, no device-lost, answer returned.

The price of not crashing

Small submissions are stable and inefficient. The long-end numbers on the locked config (-ub 512 -fa on):

Case	Prompt tokens	Latency	Prefill tok/s
96k	87,821	536.6 s	164
120k	109,755	562.4 s	195

Call it ~165–195 tok/s of prefill at the long end — a nine-and-a-half-minute wait to ingest 110k tokens. Larger micro-batches are faster right up until they wedge the GPU. On a server that exists to be used, the config that does not crash wins, and the rule generalizes: pick the configuration that survives the worst case you actually intend to serve, then optimize inside it.

Trust, then verify: needle-in-a-haystack at full window

Surviving the prefill is necessary, not sufficient — the model also has to be able to find things in a window that large. We planted one unique fact (MAUVE-MERIDIAN-7731) at 10%, 50%, and 90% depth of a ~110k-token document built from deterministic filler, then asked for it back (reasoning_effort=low):

Needle depth	Prompt tokens	Latency	Retrieved
10%	109,795	1088 s	exact
50% (lost-in-the-middle)	109,795	1083 s	exact
90%	109,795	1093 s	exact

Three for three, including the 50% position that the lost-in-the-middle literature flags as the weak spot. And the part that matters as much as the retrieval: that table is ~1 hour of continuous ~110k-token prefills, back to back, with zero device-lost events. The stability fix holds under sustained load, not just a one-shot probe.

One honest footnote from those runs: sustained back-to-back long prefills settled at ~100–113 tok/s — roughly half what the single 120k case measured. The drop was consistent across all three runs, which reads as a sustained-load thermal/clock regime on the iGPU (GPU power level is auto) rather than variance. Single long prompts are faster than a batch of them. If you benchmark this hardware, report which regime you measured.

Open threads

A sweet spot between 512 and 1024? Possibly — -ub 640 or 768 might be stable and faster. But 1024 already crashes at ~80k, the margin is thin, and the failure mode is a wedged GPU on a production box. We stopped at the config that survives.
Raising the watchdog timeout instead. amdgpu.lockup_timeoutis the other lever. We chose not to: it masks the symptom for whatever submission size you test today and removes the kernel's ability to catch a real hang. Fixing the submission size addresses the cause.

Why we care about the long tail

A 110,000-token prompt is not a synthetic curiosity. It is an agent reading a contract set, a codebase slice, a month of tickets — the workloads a local model exists to serve, because they are exactly the ones whose data should not leave the building. And those workloads hit the long tail of the context window routinely, which is why a crash that only manifests past 80k tokens matters: it is invisible in every quick test and guaranteed in production.

This box is the local-agent testbed for AGLedger, the cryptographic notary we make for automated work. The failure mode in this post is a concrete example of why that product exists: the client saw the crash as a dropped connection partway through a long prefill — the work simply vanished. An agent that runs on your own hardware has no provider logs and no vendor audit trail; if you want a signed, tamper-evident record of what it intended, what it did, and where it stopped, you have to make one. AGLedger Developer Edition runs on the same no-cloud terms as everything else on this box — offline, bundled PostgreSQL, no phone home, free and fully unlocked: start here.

Next in this series: what a local 120B model can actually do as an agent— tool-calling reliability, multi-turn round-trips, and the failure modes that matter more than tokens per second.