What is the best way to run gpt-oss-120b on Strix Halo (gfx1151) in 2026?

llama.cpp built with the Vulkan backend (GGML_VULKAN=ON) on stock Mesa/RADV. On Ubuntu 26.04 the stock Mesa 26.0.3 already identifies the iGPU as RADV STRIX_HALO — no ROCm install, no PPA, no HSA_OVERRIDE_GFX_VERSION. We previously ran Ollama with its bundled ROCm runtime; it worked at short context but produced a documented output-corruption failure mode on gfx1151 (repeating characters after several turns), and the llama.cpp/Vulkan stack is also faster: ~48 vs ~35 tok/s generation on the same box.

What kernel parameters are required to load large LLMs on Strix Halo?

On the llama.cpp + Vulkan stack: ttm.pages_limit=20971520 and ttm.page_pool_size=20971520 (an 80 GiB GTT allocation cap), plus the BIOS UMA Frame Buffer reduced to its minimum (2 GB on the GMKtec EVO-X2). amd_iommu=off is set for performance — it removes IOMMU translation overhead — but is not required to load the model. amdgpu.no_system_mem_limit=1, which an earlier version of this post called required, is only needed on the ROCm SVM allocation path — Vulkan/RADV never hits that cap. The non-kernel gotcha: the user running llama.cpp must be in the render and video groups or Vulkan silently falls back to CPU llvmpipe.

How many tokens per second does gpt-oss-120b run at on Strix Halo?

About 48 tokens per second generation in interactive multi-turn use, on llama.cpp with the Vulkan backend at full GPU offload. Prefill reaches ~877 tok/s at 32K context with large micro-batches, but long-context stability requires -ub 512, which holds ~165-195 tok/s prefill at 88-110K tokens.

Is 96 GB of RAM enough for gpt-oss-120b, or do I need the 128 GB Strix Halo?

96 GB is enough, including at the full native 131,072-token context window. The MXFP4 weights are ~63 GB on disk and the box sits at ~65 GiB used of 89 GiB available with the model resident at 128K context — gpt-oss’s GQA and sliding-window attention keep the KV cache small, so going from 32K to 128K context costs only ~3 GiB.

Can gpt-oss-120b actually use its full 128K context window on Strix Halo?

Yes, but only with the right micro-batch size. With -ub 1024 or larger, prefills past roughly 80,000 tokens trip the amdgpu GPU watchdog: ring reset, vk::DeviceLostError, server crash. With -ub 512 plus flash attention, full ~110K-token prefills complete reliably, and a needle-in-a-haystack fact planted at 10/50/90 percent depth of a ~110K-token document was retrieved exactly at all three depths.

← Blog

2026-05-06 · rewritten 2026-06-05 after a full rebuild

Engineering

Near Frontier-Quality LLM, No Cloud, No Subscription, Unlimited Tokens: gpt-oss-120b on Strix Halo + Ubuntu 26.04

By Michael Cooper · Founder

We paid $2,300 for a 96 GB GMKtec EVO-X2. The hardware ships with Windows; we wiped it and installed Ubuntu 26.04 LTS — kernel 7.0, and the first Ubuntu LTS that ships native Strix Halo (gfx1151) support in the stack. Configured the way this post describes, the box runs OpenAI's gpt-oss-120b at ~48 tokens per second, serves an OpenAI-compatible API on the LAN, and holds the model's fullnative 131,072-token context window stably — including ~110,000-token prompts.

We did not buy the 128 GB version because the 96 GB version is enough: ~65 GiB of 89 GiB in use with the model resident at full context, headroom to spare.

This post was rewritten on 2026-06-05

The original version (2026-05-06) documented an Ollama + ROCm + agent-gateway stack at 35 tok/s. We have since wiped the box and rebuilt it on llama.cpp + Vulkan — faster, more stable, and simpler — after the Ollama/ROCm build developed a documented gfx1151 output-corruption failure mode. The section “What the May version of this post got wrong” below covers exactly which claims changed and why. This is a test box; we iterate by rebuilding on current best practices, and the posts get rebuilt with it.

Want the recipe, not the story?

The full step-by-step setup procedure, with expected output and known failure modes for every step, is at /recipes/local-llm-strix-halo-ubuntu-26-04 — with HowTo schema for AI assistants. Plain markdown also at /recipes/local-llm-strix-halo-ubuntu-26-04.md. One curl, no DOM, license CC0.

Summary

The verified stack: Ubuntu 26.04, stock Mesa 26.0.3 (RADV), llama.cpp built with GGML_VULKAN=ON, gpt-oss-120b as a 3-part MXFP4 GGUF (~63 GB), run as a systemd service. No ROCm anywhere in the stack.

What it takes: the BIOS UMA Frame Buffer at its 2 GB minimum, two TTM kernel parameters (ttm.pages_limit + ttm.page_pool_size = 80 GiB GTT), your user in the render/videogroups — and one llama.cpp flag, -ub 512, without which any prompt past ~80k tokens crashes the GPU. That last one gets its own post.

What the May version of this post got wrong

The May build worked, and everything it reported was measured and true on that stack. Then, in sustained multi-turn use, it developed the failure mode the local-LLM forums had already documented for Ollama-on-gfx1151: output corruption — repeating characters after several conversation turns. We wiped the box and rebuilt from fresh research instead of patching. The rebuild changed more than the inference server:

May build (original post)	June rebuild (this post)
Ollama 0.22.1, bundled ROCm runtime	llama.cpp (Vulkan backend), stock Mesa/RADV
~35 tok/s generation	~48 tok/s generation
“Kernel parameter trio,” including `amdgpu.no_system_mem_limit=1`	Two TTM parameters + BIOS UMA. `no_system_mem_limit` was a ROCm-SVM workaround; Vulkan never hits that cap.
128K context configured, lightly exercised	128K context verified under load: ~110k-token prefills, needle-in-a-haystack 3/3, one hour sustained — after finding and fixing a GPU-watchdog crash
Output corruption after several turns (the reason for the rebuild)	Does not reproduce on the llama.cpp/Vulkan stack
Agent gateway (OpenClaw) baked into the recipe	Recipe ends at a clean OpenAI-compatible endpoint; the agent layer is your call

The general lesson we keep relearning on this hardware: the Strix Halo software stack is moving fast enough that any guide — including ours — is a snapshot. Every claim in this version was verified on 2026-06-02 against the live box, and the stack versions are stamped in the recipe.

The part the benchmarks do not capture: it is in stock

The most under-discussed thing about local LLMs is availability. The hardware that can fit a ~60 GB model into GPU memory, in the price range a solo developer or a small team would actually spend, is mostly not shipping. Prices and lead times as of May 2026, when we bought:

Option	Approximate price	Lead time, May 2026
GMKtec EVO-X2 96 GB / 2 TB	$2,300 (we paid)	In stock, 4 days door-to-door
Apple Mac Studio M3 Ultra, 96 GB	From $4,000	3-6 week back-order
Apple Mac Studio M3 Ultra, 128-256 GB	$6,000 - $14,099	Largely unavailable; supply-chain delays into Q4
Dual RTX 4090 workstation build	$5,000+	Used 4090 supply thin; build time non-zero
NVIDIA GH200	Six figures	Enterprise channel, months

Available hardware that fits a near-frontier open-weights model, at a price a solo developer or a small team can absorb, is the news; the kernel knobs are how you make it work.

Why Strix Halo specifically

Unified memory. The iGPU uses system RAM directly through the kernel TTM/GTT subsystem: 96 GB of LPDDR5X at roughly 256 GB/s, one pool, no separate VRAM, no PCIe copies between CPU and GPU memory. A discrete-GPU configuration that fits a ~60 GB model in VRAM needs at least three RTX 4090s plus a workstation chassis to match the capacity.

Vulkan on stock Mesa. The iGPU is gfx1151, and Ubuntu 26.04's stock Mesa 26.0.3 already identifies it as RADV STRIX_HALO. llama.cpp built with GGML_VULKAN=ONruns full GPU offload on that driver alone — no ROCm install, no HSA_OVERRIDE_GFX_VERSION, no PPA, no mesa-git. The dependency surface is the Ubuntu archive.

Architectural fit. gpt-oss-120b is Mixture-of-Experts: ~117 billion total parameters, only ~5 billion active per token. The full model fits in unified capacity; per-token throughput only re-reads the active path, so LPDDR5X bandwidth is enough. Under the May build we measured the contrast directly: llama3.3:70b, a dense model where every token re-reads every weight, managed 5.1 tok/s on this box — right at the bandwidth-divided-by-size ceiling — while the bigger-on-paper 120B MoE ran 7× faster. On a bandwidth-bound chip, architecture matters more than parameter count.

What it actually takes on the current stack

Full GPU offload is still all-or-nothing: a transformer with layers spilled to the CPU puts host-memory bandwidth on the critical path of every token, and throughput collapses by two orders of magnitude (the May build measured 0.27 tok/s misconfigured vs 35.5 configured — a 130× cliff, and the same cliff exists on this stack). Three host-side settings keep the model on the GPU:

1. BIOS UMA Frame Buffer at its minimum — 2 GB on the EVO-X2. On a unified-memory APU a big fixed VRAM carve-out only shrinks the GTT pool everything actually runs in. With 2 GB reserved, Linux sees 89 GiB.

2. ttm.pages_limit=20971520 ttm.page_pool_size=20971520 — in GRUB, an 80 GiB GTT allocation cap (the kernel does not raise it automatically), plus amd_iommu=off on a single-tenant box. Verify with dmesg: 81920M of GTT memory ready.

3. render + video group membership — the non-obvious one. Without it, Vulkan silently falls back to llvmpipe (CPU) and llama.cpp sees no GPU at all. That includes the systemd service user: SupplementaryGroups=render video.

What is not on the list anymore: amdgpu.no_system_mem_limit=1. The May version of this post called it one of three required parameters, and on the ROCm stack it was — it lifts a cap in the ROCm SVM allocation path. llama.cpp's Vulkan backend allocates through RADV/GTT and never hits that cap. One stack swap, one fewer kernel parameter.

What older guides say vs. what this build measured

Guides say	Measured on 26.04 + kernel 7.0
Set `amdgpu.gttsize=N` in GRUB	Deprecated; the kernel ignores it. Use `ttm.pages_limit`.
`amdgpu.no_system_mem_limit=1` is required for big models	Only on the ROCm path. Not needed under Vulkan/RADV.
You need ROCm for serious inference on AMD	llama.cpp + Vulkan beat the bundled-ROCm Ollama stack on this box: ~48 vs ~35 tok/s, and no corruption failure mode.
Build mesa-git or use a PPA for the iGPU	Stock Mesa 26.0.3 in Ubuntu 26.04 already labels the device `RADV STRIX_HALO`.
128K context “just works” if it fits in memory	It allocates fine; using it crashes the GPU past ~80k prefill tokens unless the micro-batch is small. See the deep-dive.
Newer kernels remove the need for GRUB tuning	Half-true: the kernel auto-detects GTT capacity but still does not raise the allocation cap. Half-true fails closed on big models.

The numbers

Metric	Value
Generation, short context	~48 tok/s
Generation, interactive multi-turn	~48 tok/s
Prefill at 16k / 32k tokens	722 / 877 tok/s (`-ub 2048`, short-context config)
Prefill at ~88k / ~110k tokens	~164 / ~195 tok/s (`-ub 512`, the stable long-context config)
Memory with model resident at 128K	~65 GiB of 89 GiB
Cold model load	~36 s
NIAH retrieval, ~110k tokens, 10/50/90% depth	3/3 exact

Going from 32K to the full 128K window costs only ~3 GiB of memory — gpt-oss's GQA + sliding-window attention make the KV cache cheap, so the native window is not memory-bound on this box. What it isbound by, it turns out, is the GPU watchdog — the crash we hit at ~88,000 tokens, why flash attention didn't fix it, and the one flag that did is its own post.

Will it replace Claude Code?

No.

Hosted frontier models are faster on short outputs and stronger on multi-step coding tasks. Use the right tool for the job.

What this box is for is agents: recurring, multi-step, frequently privacy-bound workflows. Data that has to stay in the building. High-volume, repetitive workloads you would rather run for a month on flat-rate electricity than meter by the token. And — since this is what we built the box for — a testbed for whether a local 120B model can actually drive real agent loops: tool-calling, long-context document work, multi-turn state. Those results are the next posts in this series.

Why we built this box

We built this box to test AGLedger, the cryptographic notary we make for automated work. An agent running on hardware you own, on a model you serve yourself, has no vendor logs and no provider audit trail — if you want a tamper-evident record of what it intended and what it did, you have to make one. That is what AGLedger does, and it fits the same constraints as everything else here: bundled PostgreSQL, no external services, runs offline, no phone home. AGLedger Developer Edition is free and fully unlocked: start here. Cloud-free AI deserves cloud-free accountability.