Recipe · verified 2026-06-02
Engineering recipegpt-oss-120b on Strix Halo + Ubuntu 26.04 (llama.cpp + Vulkan)
Agent-readable version
Plain markdown at /recipes/local-llm-strix-halo-ubuntu-26-04.md. Fetch with one curl, ingest with any LLM tool, license CC0.
curl -O https://agledger.ai/recipes/local-llm-strix-halo-ubuntu-26-04.md
Revision note — 2026-06-02 rebuild
An earlier version of this recipe (verified 2026-05-06) used Ollama with its bundled ROCm runtime plus an agent gateway. That stack worked at short context but produced a documented output-corruption failure mode on gfx1151 (repeating characters after several conversation turns) and has been retired. The current recipe builds llama.cpp directly with the Vulkan backend on stock Mesa/RADV: faster (~48 vs ~35 tok/s generation), more stable, and one fewer kernel parameter — amdgpu.no_system_mem_limit=1 was a ROCm-path requirement and is not needed on Vulkan.
What this recipe produces
- A 96 GB Strix Halo box (or any AMD Ryzen AI MAX+ 395 platform with comparable BIOS access)
- gpt-oss-120b (MXFP4 GGUF) at full GPU offload via llama.cpp + Vulkan/RADV
- The full native 131,072-token context window, stable— ~110k-token prefills with zero GPU resets (requires
-ub 512; see Step 7) - ~48 tok/s generation in interactive multi-turn use
- An OpenAI-compatible HTTP API on the LAN at port 8080, run as a systemd service that survives reboots
- ~65 GiB of 89 GiB system memory in use with the model resident at full context
Prerequisites
Hardware
- AMD Ryzen AI MAX+ 395 (“Strix Halo”) APU with Radeon 8060S iGPU (gfx1151)
- 96 GB or 128 GB unified LPDDR5X memory (96 GB is sufficient for gpt-oss-120b at full 128K context)
- ~80 GB free on NVMe for the model weights (3-part GGUF, ~59 GB)
- LAN connectivity
Software
- Ubuntu 26.04 LTS, fresh install with kernel 7.0+
sudoaccess- No ROCm install, no PPA, no mesa-git: stock Mesa 26.0.3 already identifies the device as
RADV STRIX_HALO
Software versions verified
Ubuntu 26.04 LTS (Resolute Raccoon)
kernel 7.0.0-22-generic
Mesa / RADV 26.0.3-1ubuntu1
llama.cpp build 4fb16ec (GGML_VULKAN=ON, Release)
gpt-oss-120b MXFP4 GGUF (ggml-org), 3 parts, ~63 GB
Step 1 — BIOS: reduce UMA Frame Buffer to minimum
At boot, enter the BIOS setup utility (typically Delete or F2 on the EVO-X2). Find the UMA Frame Buffer Size setting (under Memory or Advanced/AMD CBS depending on platform).
Set to the BIOS minimum. On the GMKtec EVO-X2, that minimum is 2 GB. Save and reboot.
Verify after boot:
sudo dmesg | grep "of VRAM memory ready" # Expect: amdgpu 0000:c5:00.0: 2048M of VRAM memory ready
On a unified-memory APU, BIOS-reserved VRAM is invisible to Linux as system RAM. The iGPU reaches the model through the kernel's GTT pool (ordinary system RAM pinned for GPU use), so a big fixed carve-out only shrinks the pool everything actually runs in. With 2 GB reserved, the 96 GB box exposes 89 GiB to Linux.
Step 2 — Kernel command line: raise the GTT cap
sudo cp /etc/default/grub /etc/default/grub.bak.$(date +%s) sudo sed -i 's|^GRUB_CMDLINE_LINUX_DEFAULT="[^"]*"|GRUB_CMDLINE_LINUX_DEFAULT="amd_iommu=off ttm.pages_limit=20971520 ttm.page_pool_size=20971520"|' /etc/default/grub sudo update-grub sudo systemctl reboot
20971520pages × 4 KiB = 80 GiB GPU-allocatable GTT. ttm.page_pool_size matches so the TTM page pool can back the full cap. amd_iommu=off removes IOMMU translation overhead on the unified-memory path.
Verify after reboot:
cat /proc/cmdline | tr ' ' '\n' | grep -E 'ttm|iommu' # Expect: # amd_iommu=off # ttm.pages_limit=20971520 # ttm.page_pool_size=20971520 cat /sys/module/ttm/parameters/pages_limit # Expect: 20971520 sudo dmesg | grep "of GTT memory ready" # Expect: amdgpu 0000:c5:00.0: 81920M of GTT memory ready.
The kernel does not auto-raise the GTT allocation cap to match installed RAM — without ttm.pages_limit, large single allocations fail and the model cannot fully offload.
What you do NOT need on this stack: amdgpu.no_system_mem_limit=1. That parameter works around a cap in the ROCm SVM allocation path; llama.cpp's Vulkan backend allocates through RADV/GTT and never hits it. (amdgpu.gttsize=is deprecated and ignored on kernel 7.0 — some older guides still recommend it.)
Step 3 — Add your user to the render and video groups
sudo usermod -aG render,video $USER # Log out and back in (or reboot) for membership to apply
Verify:
groups | tr ' ' '\n' | grep -E 'render|video' # Expect both: render # video sudo apt install -y vulkan-tools vulkaninfo --summary | grep deviceName # Expect: deviceName = Radeon 8060S Graphics (RADV STRIX_HALO)
Without render/video membership, the user cannot open /dev/dri/renderD*. Vulkan then silently falls back to llvmpipe (CPU software rasterizer), and llama.cpp sees no usable GPU. This is the single most common way this build “works but is 50× too slow.”
Step 4 — Build llama.cpp with the Vulkan backend
sudo apt install -y build-essential cmake git libvulkan-dev glslc glslang-tools spirv-headers cd ~ git clone https://github.com/ggml-org/llama.cpp cd llama.cpp cmake -B build -DGGML_VULKAN=ON -DCMAKE_BUILD_TYPE=Release cmake --build build --config Release -j$(nproc)
Verify:
./build/bin/llama-server --version # Expect a version/build line, built for Linux x86_64 ./build/bin/llama-server --list-devices 2>&1 | grep -i vulkan # Expect a Vulkan0 device naming the Radeon 8060S (RADV STRIX_HALO)
Distro/prebuilt llama.cpp binaries are usually CPU-only or CUDA. The Vulkan backend must be compiled in (GGML_VULKAN=ON), and it builds against stock Ubuntu 26.04 Vulkan SDK packages — no ROCm, no AMD driver download.
Step 5 — Download the model (3-part MXFP4 GGUF, ~63 GB)
sudo mkdir -p /data/models/gpt-oss-120b && sudo chown $USER: /data/models/gpt-oss-120b pip install -U "huggingface_hub[cli]" hf download ggml-org/gpt-oss-120b-GGUF \ --include "*mxfp4*" --local-dir /data/models/gpt-oss-120b
Verify:
ls -l /data/models/gpt-oss-120b/ # Expect three files, ~63 GB total: # gpt-oss-120b-mxfp4-00001-of-00003.gguf (~13 MB index part) # gpt-oss-120b-mxfp4-00002-of-00003.gguf (~32 GB) # gpt-oss-120b-mxfp4-00003-of-00003.gguf (~32 GB)
Put models on your largest/fastest NVMe (this recipe uses /data, a dedicated ext4 partition). llama.cpp is pointed at part 1 and finds the rest automatically.
Step 6 — Run llama-server as a systemd service
sudo tee /etc/systemd/system/llama-server.service <<'EOF' [Unit] Description=llama.cpp server (gpt-oss-120b, Vulkan) After=network-online.target Wants=network-online.target StartLimitIntervalSec=120 StartLimitBurst=3 [Service] Type=simple User=youruser Group=youruser SupplementaryGroups=render video ExecStart=/home/youruser/llama.cpp/build/bin/llama-server -m /data/models/gpt-oss-120b/gpt-oss-120b-mxfp4-00001-of-00003.gguf -ngl 999 -c 131072 --jinja -fa on -ub 512 -b 2048 --host 0.0.0.0 --port 8080 Restart=on-failure RestartSec=5 [Install] WantedBy=multi-user.target EOF sudo systemctl daemon-reload sudo systemctl enable --now llama-server.service
(Replace youruser with your user.) Flag by flag:
-ngl 999— offload all layers to the GPU.-c 131072— the model's full native context window. It fits: gpt-oss's GQA + sliding-window attention keep the KV cache small (~3 GiB more than at 32K).--jinja— use the model's own chat template (required for correct gpt-oss tool-calling and the harmony format).-fa on— flash attention. Sensible, but it does NOT by itself prevent the long-context GPU crash (Step 7).-ub 512— load-bearing for long-context stability. Do not raise it. See Step 7.SupplementaryGroups=render video— the service user needs GPU device access even under systemd.StartLimitBurst=3over 120 s — repeated crashes stay failed and visible instead of silently restart-looping.
Verify (model load takes ~36 s):
sleep 40 && curl -sf http://127.0.0.1:8080/health
# Expect: {"status":"ok"}
journalctl -u llama-server -b --no-pager | grep -iE "vulkan|n_ctx" | head
# Expect a Vulkan device line naming RADV STRIX_HALO and n_ctx = 131072Step 7 — The flag that keeps 128K from crashing the GPU: -ub 512
This is the part most guides do not cover, because it only bites past ~80,000 tokens of prefill. Each prefill compute submission covers -ub (micro-batch) tokens. At high context, attention over the large KV cache makes a single submission expensive enough to exceed the amdgpu compute-ring watchdog. The kernel resets the ring, the Vulkan device is lost, and llama-server dies mid-request:
amdgpu: ring comp_1.1.0 timeout, signaled seq=330115, emitted seq=330117 amdgpu: Starting comp_1.1.0 ring reset ... device wedged, but recovered through reset llama-server: terminate called after throwing 'vk::DeviceLostError' systemd: llama-server.service: Failed with result 'core-dump'
Measured on this box, one variable at a time:
| Config | 64k | ~80k | ~88k | ~110k |
|---|---|---|---|---|
| -ub 2048, no FA | OK | — | DeviceLost | — |
| -ub 2048 + -fa on | — | — | DeviceLost | — |
| -ub 1024 + -fa on | — | DeviceLost | — | — |
| -ub 512 + -fa on | — | OK | OK | OK |
- Flash attention alone does not fix it.
-ub 1024still crashes, and its throughput advantage decays with context anyway (~360 tok/s at 55k → ~155 tok/s by 80k).-ub 512is the only tested config that completes a full ~110k-token prefill — and it held through ~1 hour of sustained back-to-back 110k prefills with zero device-lost events.
Retrieval quality at the full window, same config: a needle-in-a-haystack fact planted at 10% / 50% / 90% depth of a ~110k-token document was retrieved exactly at all three depths. Full data and the debugging story: /blog/gpt-oss-120b-128k-context-strix-halo.
Step 8 — Host tuning (optional but used on the verified box)
echo "vm.swappiness=10" | sudo tee /etc/sysctl.d/99-llm-tuning.conf sudo sysctl -p /etc/sysctl.d/99-llm-tuning.conf sudo tee /etc/systemd/system/llm-tuning.service <<'EOF' [Unit] Description=LLM tuning: transparent huge pages + AMD GPU power level After=multi-user.target [Service] Type=oneshot ExecStart=/bin/sh -c "echo always > /sys/kernel/mm/transparent_hugepage/enabled" ExecStart=/bin/sh -c "echo always > /sys/kernel/mm/transparent_hugepage/defrag" ExecStart=/bin/sh -c "for f in /sys/class/drm/card*/device/power_dpm_force_performance_level; do [ -w \"$f\" ] && echo auto > \"$f\"; done" RemainAfterExit=true [Install] WantedBy=multi-user.target EOF sudo systemctl daemon-reload sudo systemctl enable --now llm-tuning.service
Use auto, not high, for power_dpm_force_performance_level. The auto governor still ramps the GPU fully during inference; pinning high removes cooling time between bursts on a small-form-factor chassis.
Step 9 — Smoke test the OpenAI-compatible API
curl -s http://127.0.0.1:8080/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "gpt-oss-120b",
"messages": [{"role": "user", "content": "Reply with exactly two words."}],
"max_tokens": 32
}'Expect an OpenAI-style JSON response in ~1–2 s warm. Any OpenAI-compatible client on the LAN can point at http://<box-ip>:8080/v1.
gpt-oss quirk: the model sometimes returns an empty content with the entire answer in reasoning_content(its harmony “analysis” channel). Clients must read both fields. Set reasoning effort via {"chat_template_kwargs": {"reasoning_effort": "low"}}.
Performance targets
Measured on the verified box (96 GB EVO-X2, locked config above). If your numbers are >20% off, re-check Steps 1–3.
| Metric | Value |
|---|---|
| Generation, short context | ~48 tok/s |
| Generation, interactive multi-turn | ~48 tok/s |
| Prefill at ~88k tokens (single prompt) | ~164 tok/s |
| Prefill at ~110k tokens (single prompt) | ~195 tok/s |
| Prefill, sustained back-to-back ~110k prompts | ~100–113 tok/s (thermal regime) |
| Memory with model resident at 128K | ~65 GiB of 89 GiB |
| Cold model load (service start) | ~36 s |
| NIAH retrieval at 10/50/90% depth of ~110k tokens | 3/3 exact |
For reference, prefill at short-to-mid context is much faster (722 tok/s at 16k and 877 tok/s at 32k were measured with -ub 2048 before the stability fix; -ub 512 trades some of that for a window that does not crash).
Troubleshooting
llama.cpp reports no GPU / generation is absurdly slow
Vulkan fell back to llvmpipe. vulkaninfo --summary must list the Radeon device. Almost always missing render/videogroup membership (Step 3) — including for the systemd service user (SupplementaryGroups=).
vk::DeviceLostError / core-dump during a long prompt, with amdgpu ring timeout in dmesg
Micro-batch too large for long-context prefill on this iGPU. Set -ub 512 (Step 7). Flash attention alone will not fix it. After the crash the kernel recovers the GPU via ring reset and systemd restarts the service (~45 s), but the in-flight request is lost.
dmesg shows less GTT than expected
Kernel cmdline didn't take effect (Step 2), or the BIOS UMA carve-out is still large (Step 1) — every GiB the BIOS reserves is a GiB Linux never sees.
Responses have empty content
Not a failure. Read reasoning_content(harmony analysis channel) as the fallback — see Step 9.
Service restart-loops after repeated crashes
By design it stops: StartLimitBurst=3 in 120 s leaves the unit failed so the problem is visible rather than silently looping. sudo systemctl reset-failed llama-server && sudo systemctl start llama-server after fixing the cause.
What this recipe does NOT cover
- Public exposure — this is a LAN-only configuration. Public internet exposure needs reverse proxy, auth, and rate limiting; not in scope.
- ROCm — not used here at all. Vulkan/RADV on stock Mesa is the verified path on this hardware.
- Ollama — the previous version of this recipe used it; retired after a documented gfx1151 output-corruption failure mode (repeating characters after several turns) that does not reproduce on llama.cpp + Vulkan.
- Agent/assistant frontends — this recipe ends at a clean OpenAI-compatible endpoint. What you point at it is up to you.
Why this recipe exists
This is the local-LLM testbed configuration we run for AGLedger, a cryptographic notary for automated work. We needed a reproducible, on-premises, frontier-quality LLM environment to test agent workloads against — including whether a local 120B model can drive real tool-calling loops, and what that means for proving what an agent actually did. If you are running local agents and want a tamper-evident record of every turn they take, AGLedger Developer Edition is free and fully unlocked, runs offline, no phone home: /install.