2026-05-06 · rewritten 2026-06-05 after a full rebuild
EngineeringNear Frontier-Quality LLM, No Cloud, No Subscription, Unlimited Tokens: gpt-oss-120b on Strix Halo + Ubuntu 26.04
By Michael Cooper · Founder
We paid $2,300 for a 96 GB GMKtec EVO-X2. The hardware ships with Windows; we wiped it and installed Ubuntu 26.04 LTS — kernel 7.0, and the first Ubuntu LTS that ships native Strix Halo (gfx1151) support in the stack. Configured the way this post describes, the box runs OpenAI's gpt-oss-120b at ~48 tokens per second, serves an OpenAI-compatible API on the LAN, and holds the model's fullnative 131,072-token context window stably — including ~110,000-token prompts.
We did not buy the 128 GB version because the 96 GB version is enough: ~65 GiB of 89 GiB in use with the model resident at full context, headroom to spare.
This post was rewritten on 2026-06-05
The original version (2026-05-06) documented an Ollama + ROCm + agent-gateway stack at 35 tok/s. We have since wiped the box and rebuilt it on llama.cpp + Vulkan — faster, more stable, and simpler — after the Ollama/ROCm build developed a documented gfx1151 output-corruption failure mode. The section “What the May version of this post got wrong” below covers exactly which claims changed and why. This is a test box; we iterate by rebuilding on current best practices, and the posts get rebuilt with it.
Want the recipe, not the story?
The full step-by-step setup procedure, with expected output and known failure modes for every step, is at /recipes/local-llm-strix-halo-ubuntu-26-04 — with HowTo schema for AI assistants. Plain markdown also at /recipes/local-llm-strix-halo-ubuntu-26-04.md. One curl, no DOM, license CC0.
Summary
The verified stack: Ubuntu 26.04, stock Mesa 26.0.3 (RADV), llama.cpp built with GGML_VULKAN=ON, gpt-oss-120b as a 3-part MXFP4 GGUF (~63 GB), run as a systemd service. No ROCm anywhere in the stack.
What it takes: the BIOS UMA Frame Buffer at its 2 GB minimum, two TTM kernel parameters (ttm.pages_limit + ttm.page_pool_size = 80 GiB GTT), your user in the render/videogroups — and one llama.cpp flag, -ub 512, without which any prompt past ~80k tokens crashes the GPU. That last one gets its own post.
What the May version of this post got wrong
The May build worked, and everything it reported was measured and true on that stack. Then, in sustained multi-turn use, it developed the failure mode the local-LLM forums had already documented for Ollama-on-gfx1151: output corruption — repeating characters after several conversation turns. We wiped the box and rebuilt from fresh research instead of patching. The rebuild changed more than the inference server:
| May build (original post) | June rebuild (this post) |
|---|---|
| Ollama 0.22.1, bundled ROCm runtime | llama.cpp (Vulkan backend), stock Mesa/RADV |
| ~35 tok/s generation | ~48 tok/s generation |
“Kernel parameter trio,” including amdgpu.no_system_mem_limit=1 | Two TTM parameters + BIOS UMA. no_system_mem_limit was a ROCm-SVM workaround; Vulkan never hits that cap. |
| 128K context configured, lightly exercised | 128K context verified under load: ~110k-token prefills, needle-in-a-haystack 3/3, one hour sustained — after finding and fixing a GPU-watchdog crash |
| Output corruption after several turns (the reason for the rebuild) | Does not reproduce on the llama.cpp/Vulkan stack |
| Agent gateway (OpenClaw) baked into the recipe | Recipe ends at a clean OpenAI-compatible endpoint; the agent layer is your call |
The general lesson we keep relearning on this hardware: the Strix Halo software stack is moving fast enough that any guide — including ours — is a snapshot. Every claim in this version was verified on 2026-06-02 against the live box, and the stack versions are stamped in the recipe.
The part the benchmarks do not capture: it is in stock
The most under-discussed thing about local LLMs is availability. The hardware that can fit a ~60 GB model into GPU memory, in the price range a solo developer or a small team would actually spend, is mostly not shipping. Prices and lead times as of May 2026, when we bought:
| Option | Approximate price | Lead time, May 2026 |
|---|---|---|
| GMKtec EVO-X2 96 GB / 2 TB | $2,300 (we paid) | In stock, 4 days door-to-door |
| Apple Mac Studio M3 Ultra, 96 GB | From $4,000 | 3-6 week back-order |
| Apple Mac Studio M3 Ultra, 128-256 GB | $6,000 - $14,099 | Largely unavailable; supply-chain delays into Q4 |
| Dual RTX 4090 workstation build | $5,000+ | Used 4090 supply thin; build time non-zero |
| NVIDIA GH200 | Six figures | Enterprise channel, months |
Available hardware that fits a near-frontier open-weights model, at a price a solo developer or a small team can absorb, is the news; the kernel knobs are how you make it work.
Why Strix Halo specifically
Unified memory. The iGPU uses system RAM directly through the kernel TTM/GTT subsystem: 96 GB of LPDDR5X at roughly 256 GB/s, one pool, no separate VRAM, no PCIe copies between CPU and GPU memory. A discrete-GPU configuration that fits a ~60 GB model in VRAM needs at least three RTX 4090s plus a workstation chassis to match the capacity.
Vulkan on stock Mesa. The iGPU is gfx1151, and Ubuntu 26.04's stock Mesa 26.0.3 already identifies it as RADV STRIX_HALO. llama.cpp built with GGML_VULKAN=ONruns full GPU offload on that driver alone — no ROCm install, no HSA_OVERRIDE_GFX_VERSION, no PPA, no mesa-git. The dependency surface is the Ubuntu archive.
Architectural fit. gpt-oss-120b is Mixture-of-Experts: ~117 billion total parameters, only ~5 billion active per token. The full model fits in unified capacity; per-token throughput only re-reads the active path, so LPDDR5X bandwidth is enough. Under the May build we measured the contrast directly: llama3.3:70b, a dense model where every token re-reads every weight, managed 5.1 tok/s on this box — right at the bandwidth-divided-by-size ceiling — while the bigger-on-paper 120B MoE ran 7× faster. On a bandwidth-bound chip, architecture matters more than parameter count.
What it actually takes on the current stack
Full GPU offload is still all-or-nothing: a transformer with layers spilled to the CPU puts host-memory bandwidth on the critical path of every token, and throughput collapses by two orders of magnitude (the May build measured 0.27 tok/s misconfigured vs 35.5 configured — a 130× cliff, and the same cliff exists on this stack). Three host-side settings keep the model on the GPU:
1. BIOS UMA Frame Buffer at its minimum — 2 GB on the EVO-X2. On a unified-memory APU a big fixed VRAM carve-out only shrinks the GTT pool everything actually runs in. With 2 GB reserved, Linux sees 89 GiB.
2. ttm.pages_limit=20971520 ttm.page_pool_size=20971520 — in GRUB, an 80 GiB GTT allocation cap (the kernel does not raise it automatically), plus amd_iommu=off on a single-tenant box. Verify with dmesg: 81920M of GTT memory ready.
3. render + video group membership — the non-obvious one. Without it, Vulkan silently falls back to llvmpipe (CPU) and llama.cpp sees no GPU at all. That includes the systemd service user: SupplementaryGroups=render video.
What is not on the list anymore: amdgpu.no_system_mem_limit=1. The May version of this post called it one of three required parameters, and on the ROCm stack it was — it lifts a cap in the ROCm SVM allocation path. llama.cpp's Vulkan backend allocates through RADV/GTT and never hits that cap. One stack swap, one fewer kernel parameter.
What older guides say vs. what this build measured
| Guides say | Measured on 26.04 + kernel 7.0 |
|---|---|
Set amdgpu.gttsize=N in GRUB | Deprecated; the kernel ignores it. Use ttm.pages_limit. |
amdgpu.no_system_mem_limit=1 is required for big models | Only on the ROCm path. Not needed under Vulkan/RADV. |
| You need ROCm for serious inference on AMD | llama.cpp + Vulkan beat the bundled-ROCm Ollama stack on this box: ~48 vs ~35 tok/s, and no corruption failure mode. |
| Build mesa-git or use a PPA for the iGPU | Stock Mesa 26.0.3 in Ubuntu 26.04 already labels the device RADV STRIX_HALO. |
| 128K context “just works” if it fits in memory | It allocates fine; using it crashes the GPU past ~80k prefill tokens unless the micro-batch is small. See the deep-dive. |
| Newer kernels remove the need for GRUB tuning | Half-true: the kernel auto-detects GTT capacity but still does not raise the allocation cap. Half-true fails closed on big models. |
The numbers
| Metric | Value |
|---|---|
| Generation, short context | ~48 tok/s |
| Generation, interactive multi-turn | ~48 tok/s |
| Prefill at 16k / 32k tokens | 722 / 877 tok/s (-ub 2048, short-context config) |
| Prefill at ~88k / ~110k tokens | ~164 / ~195 tok/s (-ub 512, the stable long-context config) |
| Memory with model resident at 128K | ~65 GiB of 89 GiB |
| Cold model load | ~36 s |
| NIAH retrieval, ~110k tokens, 10/50/90% depth | 3/3 exact |
Going from 32K to the full 128K window costs only ~3 GiB of memory — gpt-oss's GQA + sliding-window attention make the KV cache cheap, so the native window is not memory-bound on this box. What it isbound by, it turns out, is the GPU watchdog — the crash we hit at ~88,000 tokens, why flash attention didn't fix it, and the one flag that did is its own post.
Will it replace Claude Code?
No.
Hosted frontier models are faster on short outputs and stronger on multi-step coding tasks. Use the right tool for the job.
What this box is for is agents: recurring, multi-step, frequently privacy-bound workflows. Data that has to stay in the building. High-volume, repetitive workloads you would rather run for a month on flat-rate electricity than meter by the token. And — since this is what we built the box for — a testbed for whether a local 120B model can actually drive real agent loops: tool-calling, long-context document work, multi-turn state. Those results are the next posts in this series.
Why we built this box
We built this box to test AGLedger, the cryptographic notary we make for automated work. An agent running on hardware you own, on a model you serve yourself, has no vendor logs and no provider audit trail — if you want a tamper-evident record of what it intended and what it did, you have to make one. That is what AGLedger does, and it fits the same constraints as everything else here: bundled PostgreSQL, no external services, runs offline, no phone home. AGLedger Developer Edition is free and fully unlocked: start here. Cloud-free AI deserves cloud-free accountability.
Sources & further reading
llama.cpp on GitHub — the inference server, built with GGML_VULKAN=ON
gpt-oss-120b GGUF (ggml-org) on Hugging Face — the 3-part MXFP4 quantization used here
gpt-oss-120b model card on Hugging Face
ServeTheHome — GMKtec EVO-X2 review
Linux kernel amdgpu driver documentation
Mesa 26.0 release notes (RADV STRIX_HALO)
arXiv 2205.14135 — FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness