2026-06-05
EngineeringThe 88,000-Token Crash: Making gpt-oss-120b Survive Its Full 128K Context on Strix Halo
By Michael Cooper · Founder
A 128K context window that allocates is not a 128K context window that works. This post is the gap between the two on a $2,300 Strix Halo box: a GPU crash that only appears past ~80,000 tokens of prefill, the kernel log that explains it, the flag matrix that isolates it, and the single llama.cpp setting that fixes it — with the throughput bill, measured.
The hardware and stack are covered in the build post (96 GB GMKtec EVO-X2, AMD Ryzen AI MAX+ 395 / Radeon 8060S iGPU, Ubuntu 26.04, llama.cpp with the Vulkan backend on stock Mesa/RADV) and reproducible step-by-step in the recipe.
Summary
With micro-batches of 1024 or 2048 tokens, any prompt past roughly 80,000–88,000 tokens kills the GPU mid-prefill: a single compute submission exceeds the amdgpu watchdog, the kernel resets the compute ring, the Vulkan device is lost, and llama-server core-dumps. Flash attention does not prevent it.
-ub 512 does. It is the only configuration we tested that completes a full ~110,000-token prefill, and it then held one hour of sustained 110k-token loads with zero device resets and perfect needle-in-a-haystack retrieval at every depth. The price: ~195 tok/s prefill at the long end, versus faster-but-crashing larger batches.
Memory was never the problem
The intuitive worry about running a 120B model at 128K context on a 96 GB box is memory. That worry is wrong on this model. Raising -c 32768 to -c 131072moved the box from ~62 to ~65 GiB used — about 3 GiB for a 4× larger window, with ~24 GiB still free. gpt-oss uses grouped-query attention plus sliding-window attention on alternating layers, so its KV cache is dramatically cheaper than a full-attention model of the same size. The full native window allocates with room to spare.
So the window opens fine. The first conclusion we wrote down was “128K holds.” It allocates. Using it was another matter.
The crash: device-lost at ~88,000 tokens
We ran a context-fill curve — a trivial question appended after K tokens of deterministic filler — to measure prefill latency against length. On the as-shipped configuration (-ub 2048, no flash attention):
| Case | Prompt tokens | Latency | Prefill tok/s |
|---|---|---|---|
| 16k | 14,718 | 20.4 s | 722 |
| 32k | 29,337 | 33.5 s | 877 |
| 64k | 58,587 | 142.2 s | 412 |
| 96k | ~88,000 | crashed mid-prefill | — |
The kernel log tells the whole story in four lines:
amdgpu: ring comp_1.1.0 timeout, signaled seq=330115, emitted seq=330117
amdgpu: Starting comp_1.1.0 ring reset ... device wedged, but recovered through reset
llama-server: terminate called after throwing 'vk::DeviceLostError'
what(): vk::Queue::submit: ErrorDeviceLost
systemd: llama-server.service: Failed with result 'core-dump' ... Scheduled restartRead the sequence counters: signaled 330115, emitted 330117. The GPU was two submissions from caught-up. This is not a slowly-degrading queue backing up — one individual compute submission stalled long enough to exceed the amdgpu watchdog, and the kernel did exactly what it is designed to do: reset the compute ring and recover the device. The recovery works — the GPU survives — but the Vulkan device handle is gone, llama-server throws and core-dumps, systemd restarts it (~45 s back to healthy), and the client gets a dropped connection partway through the prefill.
The mechanism: submissions grow with context
Each prefill submission to the GPU covers -ub(micro-batch) tokens. The cost of that submission is not constant: every micro-batch attends over the entire KV cache built so far. At 16k context a 2048-token submission is quick; at 88k context, the same 2048-token submission carries attention over 88,000 cached tokens — on an iGPU, that single submission now runs long enough to look like a hang to the kernel's watchdog. It is not a hang. It just is not finished. The watchdog cannot tell the difference.
That framing makes a prediction: shrink the submission, survive the window. So we tested one variable at a time.
One variable at a time
| Config | 64k | ~80k | ~88k | ~110k |
|---|---|---|---|---|
| -ub 2048, no FA (as shipped) | OK | — | DeviceLost | — |
| -ub 2048 + -fa on | — | — | DeviceLost | — |
| -ub 1024 + -fa on | — | DeviceLost | — | — |
| -ub 512 + -fa on | — | OK | OK | PASS |
Three findings out of that table:
- Flash attention did not fix the crash.
-fa onwith-ub 2048died at the same ~88k. FA reduces memory traffic and helps throughput, but the watchdog math is about wall-clock per submission, and at this scale FA alone does not bring a 2048-token submission under the limit. (We keep it on anyway — it is otherwise sensible.) - -ub 1024 is not a compromise; it is a slower way to crash. It died earlier(~80k), and its throughput advantage evaporates as context grows — ~360 tok/s at 55k decaying to ~155 tok/s by 80k, converging on what -ub 512 delivers anyway. At long context you pay the stability penalty without keeping the speed.
- -ub 512 completed the full window. 109,755 tokens of prefill, no device-lost, answer returned.
The price of not crashing
Small submissions are stable and inefficient. The long-end numbers on the locked config (-ub 512 -fa on):
| Case | Prompt tokens | Latency | Prefill tok/s |
|---|---|---|---|
| 96k | 87,821 | 536.6 s | 164 |
| 120k | 109,755 | 562.4 s | 195 |
Call it ~165–195 tok/s of prefill at the long end — a nine-and-a-half-minute wait to ingest 110k tokens. Larger micro-batches are faster right up until they wedge the GPU. On a server that exists to be used, the config that does not crash wins, and the rule generalizes: pick the configuration that survives the worst case you actually intend to serve, then optimize inside it.
Trust, then verify: needle-in-a-haystack at full window
Surviving the prefill is necessary, not sufficient — the model also has to be able to find things in a window that large. We planted one unique fact (MAUVE-MERIDIAN-7731) at 10%, 50%, and 90% depth of a ~110k-token document built from deterministic filler, then asked for it back (reasoning_effort=low):
| Needle depth | Prompt tokens | Latency | Retrieved |
|---|---|---|---|
| 10% | 109,795 | 1088 s | exact |
| 50% (lost-in-the-middle) | 109,795 | 1083 s | exact |
| 90% | 109,795 | 1093 s | exact |
Three for three, including the 50% position that the lost-in-the-middle literature flags as the weak spot. And the part that matters as much as the retrieval: that table is ~1 hour of continuous ~110k-token prefills, back to back, with zero device-lost events. The stability fix holds under sustained load, not just a one-shot probe.
One honest footnote from those runs: sustained back-to-back long prefills settled at ~100–113 tok/s — roughly half what the single 120k case measured. The drop was consistent across all three runs, which reads as a sustained-load thermal/clock regime on the iGPU (GPU power level is auto) rather than variance. Single long prompts are faster than a batch of them. If you benchmark this hardware, report which regime you measured.
Open threads
- A sweet spot between 512 and 1024? Possibly —
-ub 640or768might be stable and faster. But 1024 already crashes at ~80k, the margin is thin, and the failure mode is a wedged GPU on a production box. We stopped at the config that survives. - Raising the watchdog timeout instead.
amdgpu.lockup_timeoutis the other lever. We chose not to: it masks the symptom for whatever submission size you test today and removes the kernel's ability to catch a real hang. Fixing the submission size addresses the cause.
Why we care about the long tail
A 110,000-token prompt is not a synthetic curiosity. It is an agent reading a contract set, a codebase slice, a month of tickets — the workloads a local model exists to serve, because they are exactly the ones whose data should not leave the building. And those workloads hit the long tail of the context window routinely, which is why a crash that only manifests past 80k tokens matters: it is invisible in every quick test and guaranteed in production.
This box is the local-agent testbed for AGLedger, the cryptographic notary we make for automated work. The failure mode in this post is a concrete example of why that product exists: the client saw the crash as a dropped connection partway through a long prefill — the work simply vanished. An agent that runs on your own hardware has no provider logs and no vendor audit trail; if you want a signed, tamper-evident record of what it intended, what it did, and where it stopped, you have to make one. AGLedger Developer Edition runs on the same no-cloud terms as everything else on this box — offline, bundled PostgreSQL, no phone home, free and fully unlocked: start here.
Next in this series: what a local 120B model can actually do as an agent— tool-calling reliability, multi-turn round-trips, and the failure modes that matter more than tokens per second.
Sources & further reading
llama.cpp on GitHub — the inference server (Vulkan backend)
gpt-oss-120b GGUF (ggml-org) on Hugging Face
gpt-oss-120b model card on Hugging Face — GQA + sliding-window attention details
Linux kernel amdgpu driver documentation
Linux kernel DRM documentation — device reset and robustness semantics
arXiv 2205.14135 — FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
arXiv 2307.03172 — Lost in the Middle: How Language Models Use Long Contexts