2026-05-06
EngineeringNear Frontier-Quality LLM, No Cloud, No Subscription, Unlimited Tokens: gpt-oss:120b on Strix Halo + Ubuntu 26.04
By Michael Cooper · Founder
We paid $2,300 for a 96 GB GMKtec EVO-X2. The hardware ships with Windows; we wiped it and installed Ubuntu 26.04 LTS — released last month, kernel 7.0, and the first Ubuntu LTS that ships with native Strix Halo (gfx1151) support in the stack. Once configured, the box runs OpenAI's gpt-oss:120b at roughly 35 tokens per second, draws 40 W at idle and 100-140 W under sustained inference, and answers questions from idle in ~3.85 seconds end-to-end through the OpenClaw gateway, or ~1.2 seconds direct to Ollama.
We did not buy the 128 GB version because the 96 GB version was enough — gpt-oss:120b at a 128K-token context window fits in about 68 GB, which leaves roughly 24 GB for the operating system, the agent gateway, and tool execution. As far as we can tell, this is the first published walk-through of gpt-oss:120b on this hardware against the new Ubuntu LTS.
The rest of this post is how we got it there.
Want the recipe, not the story?
The full step-by-step setup procedure, with expected output and known failure modes for every step, is at /recipes/local-llm-strix-halo-ubuntu-26-04 — with HowTo schema for AI assistants. Plain markdown also at /recipes/local-llm-strix-halo-ubuntu-26-04.md — one curl, no DOM, license CC0.
Summary
A 96 GB Strix Halo box, ordered on a Wednesday and shipping Sunday, runs gpt-oss:120b at 35 tok/s sustained on Ubuntu 26.04 with kernel 7.0 once three kernel-and-BIOS settings line up. Before those settings: 0.27 tok/s. After: 35.5 tok/s. Same model, same prompt, same hardware. A 130× speedup just from giving the GPU a contiguous-enough memory pool.
The three settings that have to land together on this chip: ttm.pages_limit=23068672, amdgpu.no_system_mem_limit=1, and the BIOS UMA Frame Buffer reduced to its minimum. Set two of them, leave the third off, and gpt-oss:120b fails to load, falls into partial CPU offload, or thrashes hard enough to make the box unresponsive.
The part the benchmarks do not capture: it is in stock
The most under-discussed thing about local LLMs in May 2026 is availability. The hardware that can fit a 65 GB model into GPU memory, in the price range a solo developer or a small team would actually spend, is mostly not shipping right now.
| Option | Approximate price | Lead time, May 2026 |
|---|---|---|
| GMKtec EVO-X2 96 GB / 2 TB | $2,300 (we paid) | In stock, 4 days door-to-door |
| Apple Mac Studio M3 Ultra, 96 GB | From $4,000 | 3-6 week back-order |
| Apple Mac Studio M3 Ultra, 128-256 GB | $6,000 - $14,099 | Largely unavailable; supply-chain delays into Q4 |
| Dual RTX 4090 workstation build | $5,000+ | Used 4090 supply thin; build time non-zero |
| NVIDIA GH200 / DGX Spark | Six figures | Enterprise channel, months |
We paid $2,300 for the GMKtec EVO-X2 at the 96 GB / 2 TB configuration. It arrived four days after the order. That is the part that made the rest of it worth writing up. Available hardware that fits a near-frontier open-weights model, at a price a solo developer or a small team can absorb, is the news; the kernel knobs are how you make it work.
Why Strix Halo specifically
Strix Halo is what makes this work. Two things matter, and then a third architectural fit.
Unified memory. The iGPU uses system RAM directly through the kernel TTM/GTT subsystem — 96 GB of LPDDR5X at roughly 256 GB/s, one pool, no separate VRAM, no PCIe copies between CPU and GPU memory. A discrete-GPU configuration that fits a 65 GB model in VRAM needs at least three RTX 4090s plus a workstation chassis to match the capacity.
ROCm on gfx1151. The iGPU is gfx1151, and Ollama 0.22.1 ships bundled ROCm runtime libraries with gfx1151 kernels. The runtime selects ROCm automatically on this chip. No HSA_OVERRIDE_GFX_VERSION, no Vulkan fallback, no third-party kernel patches.
Architectural fit. gpt-oss:120b is Mixture-of-Experts: 120 billion total parameters, only ~5 billion active per token. The full 120B fits in unified capacity. Per-token throughput only re-reads the ~5B active path, so LPDDR5X bandwidth is actually enough — you do not need HBM to keep this model fed. A 65 GB dense model on identical hardware would run at roughly the LPDDR5X ceiling divided by model size: ~5 tok/s. The 65 GB MoE runs at ~35 tok/s. Capacity for the chip, bandwidth for the architecture.
The before/after that defines the post
Before we got the kernel parameters right, gpt-oss:120b ran at 0.27 tokens per second on this box. After: 35.5 tokens per second. Same model, same prompt, same hardware.
That is a 130× speedup with zero hardware change. It is entirely about whether the GPU gets to put the entire model in its own memory or has to spill 11 percent of the layers to the CPU. Partial CPU offload on a transformer is fatal, not graceful: the host-memory bandwidth bottleneck sits on the critical path for every generated token. If you see ollama ps showing anything other than 100% GPU, treat it as a configuration failure, not a tradeoff.
The kernel-parameter trio
The fix is three changes that have to happen together. They are independent: each addresses a different constraint, and any one of them missing is enough to break gpt-oss:120b. The version of this post that landed on the Ubuntu 26.04 install we built only covered the first knob, mostly because that is what every older guide on the internet says.
1. ttm.pages_limit=23068672 — in /etc/default/grub on the GRUB_CMDLINE_LINUX_DEFAULT line. 23,068,672 pages × 4 KB = 88 GB of GPU-allocatable memory. The kernel auto-detects how much GTT exists; it does not auto-bump the allocation cap past the BIOS VRAM slice.
2. amdgpu.no_system_mem_limit=1 — same line, removes a separate AMDGPU-level cap on how much system RAM the iGPU can pin. Without it, single ROCm allocations larger than ~32 GB fail with SVM mapping failed, exceeds resident system memory limit even with TTM raised.
3. BIOS UMA Frame Buffer reduced to its minimum — on the EVO-X2 that is 2 GB. The default 32 GB reserves physical RAM as “dedicated VRAM,” invisible to Linux as system memory and unable to participate in GTT for big single buffers. On a unified-memory APU there is no physical reason to reserve a chunk; the kernel can use system RAM as GPU memory through the GTT pool just fine.
After update-grub and a reboot, confirm the kernel saw both parameters:
$ cat /proc/cmdline | tr ' ' '\n' | grep -E 'ttm|amdgpu' ttm.pages_limit=23068672 amdgpu.no_system_mem_limit=1 $ cat /sys/module/ttm/parameters/pages_limit 23068672 $ dmesg | grep 'amdgpu .* of GTT' amdgpu 0000:c4:00.0: 90112M of GTT memory ready.
What older guides say vs. what 26.04 actually does
Most of the existing corpus on Strix Halo + Linux LLMs was written between mid-2025 and early 2026 against Ubuntu 24.04 or 25.04 with kernel 6.11 - 6.16. Several pieces of that advice produce different results on a fresh 26.04 install.
| Older guides say | 26.04 + kernel 7.0 reality |
|---|---|
Set amdgpu.gttsize=N in GRUB | Deprecated. Kernel ignores it, silently logs a warning. |
| TTM is the only memory cap that matters | No. amdgpu.no_system_mem_limit=1 is a separate cap. Both are required for any single allocation past about 32 GB. |
| Leave the BIOS UMA Frame Buffer at default | Reduce it to the BIOS minimum. Every gigabyte it reserves is one Linux cannot use as GTT. |
| ROCm doesn't support gfx1151, use Vulkan | ROCm works. Ollama 0.22.1 ships bundled ROCm libraries with gfx1151 kernels and selects ROCm automatically. |
| Build mesa-git or use a PPA for the iGPU | Stock Mesa 26.0.3 in Ubuntu 26.04 already labels the device RADV STRIX_HALO. |
| Kernel 6.16.9+ removes the need for any GRUB tuning | Half-true. The kernel auto-detects exposure (Ollama reported 62 GiB available out of the box) but not allocation: TTM still defaulted to the BIOS VRAM slice. Half-true fails closed on big models. |
Benchmarks across four models on the same box
Same hardware, same Ollama 0.22.1, same prompts, same warm state. We ran a code-generation prompt and an open-ended reasoning prompt against four models:
| Model | Resident | Code task | Reasoning | Architecture |
|---|---|---|---|---|
| gemma4 | 10 GB | 54.3 tok/s | 52.0 tok/s | Dense |
| gpt-oss:20b | 13 GB | 48.7 tok/s | 47.4 tok/s | Dense, reasoning |
| gpt-oss:120b | 65 GB | 35.5 tok/s | 34.9 tok/s | MoE, ~5B active |
| llama3.3:70b | 57 GB | 5.1 tok/s | 5.1 tok/s | Dense |
The instructive comparison is the bottom two rows. The 70B is a dense model: every token re-reads every weight, so throughput is bounded by memory bandwidth divided by model size. On this box, 256 GB/s ÷ 39 GB ≈ 6.5 tok/s is the ceiling. We measured 5.1, about 78 percent of it.
The 120B is a Mixture-of-Experts model: 120 billion parameters total, only ~5 billion active per token. A token only re-reads a slice of the weights. Same bandwidth, smaller working set, much faster output. Architecture matters more than parameter count on a bandwidth-bound chip. It is why the 65 GB model runs 7× faster than the 39 GB one on identical hardware.
Cold response, sustained throughput, power
With OLLAMA_KEEP_ALIVE=30m, the model stays resident in unified memory for thirty minutes after the last request. From idle, end-to-end latency to a one-word reply through the OpenClaw gateway is ~3.85 seconds. A direct Ollama probe at the same moment runs in ~1.2 seconds — the difference is the OpenClaw tax: agent orchestration, channel routing, memory + skill resolution, and policy checks before the prompt hits the model. Faster on subsequent queries. The one-time cold load from NVMe into VRAM takes about 24 seconds; that is paid once per process lifetime, not once per request.
Idle wall power is around 40 W. Sustained inference draws 100 - 140 W: the GPU PPT peaks near 138 W during prompt-eval and drops slightly during autoregressive generation. Across a 50-minute sustained-load test the GPU edge temperature peaked at 83 °C and CPU Tctl peaked at 87.5 °C, both with comfortable margin to the silicon trip threshold. Average draw across a month of mixed-use sits closer to 60 W (mostly idle, with bursts) — roughly 40 kWh, or about $6 at the U.S. residential average of $0.16/kWh.
Memory hierarchy at peak load
96 GB Unified LPDDR5X (~256 GB/s, one chip — the only RAM there is)
│
├── BIOS UMA Frame Buffer: 2 GB (dedicated; was 32 GB default)
│
└── Linux system RAM: 89 GB (everything else, including GTT pool)
│
│ ┌─ Constraints (all three required for >32 GB single allocations) ─┐
│ │ ttm.pages_limit = 88 GB (kernel TTM allocation cap) │
│ │ amdgpu.no_system_mem_limit = 1 (removes secondary cap) │
│ │ BIOS UMA Frame Buffer reduced (frees ~28 GB to system) │
│ └────────────────────────────────────────────────────────────────┘
│
├── GTT Resident: gpt-oss:120b ~65 GB 100 % GPU offload
│ ├─ 60 GB MXFP4 weights
│ ├─ 3 GB compute graph
│ └─ 2 GB Q8_0 KV cache @ 128K context
│
└── Headroom: ~24 GB (kernel, Ollama, OpenClaw,
OS, free)The 96 GB configuration leaves us about 24 GB after the model is resident at full 128K context. We could fit a second model alongside — gemma4 at 10 GB sits inside that headroom comfortably — but we chose not to. OpenClaw's model-prewarm sidecar will load every configured fallback at gateway startup, not just on actual failover, and we wanted the headroom for tool execution and the OS rather than a fallback that almost never fires.
The agent gateway: OpenClaw
We wired the box to OpenClaw, Peter Steinberger's local-first AI assistant that bridges messaging channels (Telegram, Slack, Discord, iMessage, WhatsApp, Signal, and a dozen others) to whatever LLM you point it at. We picked it because it was the most-starred local-agent project on GitHub when we set this box up: 247,000 stars by March 2026, with 60,000 of those landing in the first 72 hours after release. Steinberger joined OpenAI in February 2026 and OpenClaw moved to a foundation, which means the project is sticking around. NVIDIA published a NemoClaw integration guide. Ollama added an official integrations page. If you want to talk to a local LLM the way you talk to a hosted assistant, this is currently the path of least resistance.
Two configuration footguns on a Strix Halo box specifically:
Provider timeout — the default per-provider request timeout is around 140 seconds, which is too short for a reasoning model on a 35 tok/s GPU. Set models.providers.ollama.timeoutSeconds to 600. Leave it at the default and gpt-oss:120b will trip the timeout on long prompts and silently fail over to your fallback model.
Fallback prewarming — model-prewarm loads every fallback at gateway startup, not just on actual failover. On a 88 GB-cap iGPU this can eat headroom you tuned the rest of the system to keep. Either remove the fallback list (openclaw config unset agents.defaults.model.fallbacks) or set per-model params.keep_alive=0 so it loads then unloads on use.
Will it replace Claude Code?
No.
Claude Sonnet 4.6 is faster on short outputs and stronger on multi-step coding tasks. Use the right tool for the job.
What this box is for is agents: recurring, multi-step, frequently privacy-bound workflows.
On the consumer end of the spectrum, that means orchestrating your kid's wedding — the vendor list, the date juggling, the RSVPs, the contract reviews, the day-of run-of-show. Multi-month, multi-stakeholder, the kind of project an agent can drive end-to-end.
On the regulated end, it means a hospital's patient check-in queue. Data that has to stay in the building. High-volume, repetitive, identity-sensitive. The kind of workload where an on-prem near-frontier model and a hosted tiny model are no longer in the same conversation.
In between: calendar and meeting management. Document triage. Email handling. Quote and invoice processing. Anything multi-step and recurring; anything you would rather run for a month on flat-rate electricity than meter by the token. And, since this is what we built the box for, the testbed for software that has to work on customer hardware that may never reach the public internet.
The chip in this box is a 2026-shipping APU running a model OpenAI did not release until 2025. By 2027 there will be a wider iGPU, a faster MoE, and the gap to hosted will narrow again.
The model describing its own deployment
We asked gpt-oss:120b, while it was running on this box, to write a paragraph introducing itself. We told it the hardware and software stack it was running on but no other details. What it produced, unedited:
I am gpt-oss:120b, a Mixture-of-Experts language model with a total of 120 billion parameters, of which roughly 5 billion are active for each token. I run locally on a GMKtec EVO-X2 mini-PC equipped with an AMD Ryzen AI MAX+ 395 (“Strix Halo”) APU and its integrated Radeon 8060S (gfx1151) GPU. The system provides 96 GB of unified LPDDR5X memory at about 256 GB/s; after a BIOS tweak only 2 GB is reserved as fixed VRAM, while the remaining 89 GB is exposed to Linux and pinned dynamically for GPU access through the kernel TTM/GTT subsystem. My weights are stored in MXFP4-quantized form, occupying roughly 65 GB of that unified pool. Inference is served by Ollama 0.22.1 with the bundled ROCm runtime targeting gfx1151, using Flash Attention and a Q8_0 KV cache to sustain a 128 K token context window. I generate roughly 35 tokens per second and receive requests via OpenClaw, which forwards Telegram direct messages from a single allow-listed user. I can report the static configuration of CPU, GPU, memory layout, and software stack, but I have no visibility into real-time sensor data such as chip temperature, current power draw, network latency, or which other models may be resident in the same runtime environment. My uptime is tracked by the host system.
Two things worth noting about that paragraph. One: it is accurate. We checked every claim against the hardware and the running configuration. Two: the model has no live access to its own runtime state — the static configuration is available because we put it in the prompt, but real-time sensor data isn't reachable from the request path. The model knows that, and says so explicitly. That is the kind of self-grounding that is easy to lose in a hosted product but relatively easy to keep in a local one.
Run the benchmark on your own box
We ship a small stdlib-only Python script that runs the same two prompts against any Ollama host and prints a Markdown results table. No dependencies, no virtualenv, no API keys. Download bench.py.
# On the host running Ollama: curl -O https://agledger.ai/blog-assets/bench.py python3 bench.py --host http://127.0.0.1:11434 --models gpt-oss:120b,gemma4 # Or from another machine on the LAN: python3 bench.py --host http://192.168.x.y:11434 --models gpt-oss:120b
If you have a Strix Halo box, an M-series Mac, a 4090 rig, or anything else that runs Ollama, post your numbers somewhere and link back. We will collect a community-numbers table on this post.
Why we built this box
We built this box to test AGLedger, the cryptographic notary we make for automated work. The chain that records what every agent on this box says it is about to do, and what it says it did, has to fit alongside everything else here — bundled PostgreSQL, no external services, runs offline, no phone home. AGLedger Developer Edition is free and fully unlocked; if you are running local agents and want a tamper-evident record of every turn they take, that is where to start. Cloud-free AI deserves cloud-free accountability. We will publish the OpenClaw wiring as a follow-up post.
Sources & further reading
OpenClaw on GitHub — the local-first agent gateway
Ollama OpenClaw integration docs
gpt-oss:120b model card on Hugging Face
ServeTheHome — GMKtec EVO-X2 review
Linux kernel amdgpu driver documentation
Mesa 26.0 release notes (RADV STRIX_HALO)
llama.cpp discussions on Strix Halo
arXiv 2205.14135 — FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness