← Blog

2026-06-05

Engineering

The 88,000-Token Crash: Making gpt-oss-120b Survive Its Full 128K Context on Strix Halo

By Michael Cooper · Founder

A 128K context window that allocates is not a 128K context window that works. This post is the gap between the two on a $2,300 Strix Halo box: a GPU crash that only appears past ~80,000 tokens of prefill, the kernel log that explains it, the flag matrix that isolates it, and the single llama.cpp setting that fixes it — with the throughput bill, measured.

The hardware and stack are covered in the build post (96 GB GMKtec EVO-X2, AMD Ryzen AI MAX+ 395 / Radeon 8060S iGPU, Ubuntu 26.04, llama.cpp with the Vulkan backend on stock Mesa/RADV) and reproducible step-by-step in the recipe.

Summary

With micro-batches of 1024 or 2048 tokens, any prompt past roughly 80,000–88,000 tokens kills the GPU mid-prefill: a single compute submission exceeds the amdgpu watchdog, the kernel resets the compute ring, the Vulkan device is lost, and llama-server core-dumps. Flash attention does not prevent it.

-ub 512 does. It is the only configuration we tested that completes a full ~110,000-token prefill, and it then held one hour of sustained 110k-token loads with zero device resets and perfect needle-in-a-haystack retrieval at every depth. The price: ~195 tok/s prefill at the long end, versus faster-but-crashing larger batches.

Memory was never the problem

The intuitive worry about running a 120B model at 128K context on a 96 GB box is memory. That worry is wrong on this model. Raising -c 32768 to -c 131072moved the box from ~62 to ~65 GiB used — about 3 GiB for a 4× larger window, with ~24 GiB still free. gpt-oss uses grouped-query attention plus sliding-window attention on alternating layers, so its KV cache is dramatically cheaper than a full-attention model of the same size. The full native window allocates with room to spare.

So the window opens fine. The first conclusion we wrote down was “128K holds.” It allocates. Using it was another matter.

The crash: device-lost at ~88,000 tokens

We ran a context-fill curve — a trivial question appended after K tokens of deterministic filler — to measure prefill latency against length. On the as-shipped configuration (-ub 2048, no flash attention):

CasePrompt tokensLatencyPrefill tok/s
16k14,71820.4 s722
32k29,33733.5 s877
64k58,587142.2 s412
96k~88,000crashed mid-prefill

The kernel log tells the whole story in four lines:

amdgpu: ring comp_1.1.0 timeout, signaled seq=330115, emitted seq=330117
amdgpu: Starting comp_1.1.0 ring reset ... device wedged, but recovered through reset
llama-server: terminate called after throwing 'vk::DeviceLostError'
              what():  vk::Queue::submit: ErrorDeviceLost
systemd: llama-server.service: Failed with result 'core-dump' ... Scheduled restart

Read the sequence counters: signaled 330115, emitted 330117. The GPU was two submissions from caught-up. This is not a slowly-degrading queue backing up — one individual compute submission stalled long enough to exceed the amdgpu watchdog, and the kernel did exactly what it is designed to do: reset the compute ring and recover the device. The recovery works — the GPU survives — but the Vulkan device handle is gone, llama-server throws and core-dumps, systemd restarts it (~45 s back to healthy), and the client gets a dropped connection partway through the prefill.

The mechanism: submissions grow with context

Each prefill submission to the GPU covers -ub(micro-batch) tokens. The cost of that submission is not constant: every micro-batch attends over the entire KV cache built so far. At 16k context a 2048-token submission is quick; at 88k context, the same 2048-token submission carries attention over 88,000 cached tokens — on an iGPU, that single submission now runs long enough to look like a hang to the kernel's watchdog. It is not a hang. It just is not finished. The watchdog cannot tell the difference.

That framing makes a prediction: shrink the submission, survive the window. So we tested one variable at a time.

One variable at a time

Config64k~80k~88k~110k
-ub 2048, no FA (as shipped)OKDeviceLost
-ub 2048 + -fa onDeviceLost
-ub 1024 + -fa onDeviceLost
-ub 512 + -fa onOKOKPASS

Three findings out of that table:

The price of not crashing

Small submissions are stable and inefficient. The long-end numbers on the locked config (-ub 512 -fa on):

CasePrompt tokensLatencyPrefill tok/s
96k87,821536.6 s164
120k109,755562.4 s195

Call it ~165–195 tok/s of prefill at the long end — a nine-and-a-half-minute wait to ingest 110k tokens. Larger micro-batches are faster right up until they wedge the GPU. On a server that exists to be used, the config that does not crash wins, and the rule generalizes: pick the configuration that survives the worst case you actually intend to serve, then optimize inside it.

Trust, then verify: needle-in-a-haystack at full window

Surviving the prefill is necessary, not sufficient — the model also has to be able to find things in a window that large. We planted one unique fact (MAUVE-MERIDIAN-7731) at 10%, 50%, and 90% depth of a ~110k-token document built from deterministic filler, then asked for it back (reasoning_effort=low):

Needle depthPrompt tokensLatencyRetrieved
10%109,7951088 sexact
50% (lost-in-the-middle)109,7951083 sexact
90%109,7951093 sexact

Three for three, including the 50% position that the lost-in-the-middle literature flags as the weak spot. And the part that matters as much as the retrieval: that table is ~1 hour of continuous ~110k-token prefills, back to back, with zero device-lost events. The stability fix holds under sustained load, not just a one-shot probe.

One honest footnote from those runs: sustained back-to-back long prefills settled at ~100–113 tok/s — roughly half what the single 120k case measured. The drop was consistent across all three runs, which reads as a sustained-load thermal/clock regime on the iGPU (GPU power level is auto) rather than variance. Single long prompts are faster than a batch of them. If you benchmark this hardware, report which regime you measured.

Open threads

Why we care about the long tail

A 110,000-token prompt is not a synthetic curiosity. It is an agent reading a contract set, a codebase slice, a month of tickets — the workloads a local model exists to serve, because they are exactly the ones whose data should not leave the building. And those workloads hit the long tail of the context window routinely, which is why a crash that only manifests past 80k tokens matters: it is invisible in every quick test and guaranteed in production.

This box is the local-agent testbed for AGLedger, the cryptographic notary we make for automated work. The failure mode in this post is a concrete example of why that product exists: the client saw the crash as a dropped connection partway through a long prefill — the work simply vanished. An agent that runs on your own hardware has no provider logs and no vendor audit trail; if you want a signed, tamper-evident record of what it intended, what it did, and where it stopped, you have to make one. AGLedger Developer Edition runs on the same no-cloud terms as everything else on this box — offline, bundled PostgreSQL, no phone home, free and fully unlocked: start here.

Next in this series: what a local 120B model can actually do as an agent— tool-calling reliability, multi-turn round-trips, and the failure modes that matter more than tokens per second.

Sources & further reading

llama.cpp on GitHub — the inference server (Vulkan backend)

gpt-oss-120b GGUF (ggml-org) on Hugging Face

gpt-oss-120b model card on Hugging Face — GQA + sliding-window attention details

Linux kernel amdgpu driver documentation

Linux kernel DRM documentation — device reset and robustness semantics

arXiv 2205.14135 — FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

arXiv 2307.03172 — Lost in the Middle: How Language Models Use Long Contexts

Related