Recipe · verified 2026-06-02

Engineering recipe

gpt-oss-120b on Strix Halo + Ubuntu 26.04 (llama.cpp + Vulkan)

Agent-readable version

Plain markdown at /recipes/local-llm-strix-halo-ubuntu-26-04.md. Fetch with one curl, ingest with any LLM tool, license CC0.

curl -O https://agledger.ai/recipes/local-llm-strix-halo-ubuntu-26-04.md

Revision note — 2026-06-02 rebuild

An earlier version of this recipe (verified 2026-05-06) used Ollama with its bundled ROCm runtime plus an agent gateway. That stack worked at short context but produced a documented output-corruption failure mode on gfx1151 (repeating characters after several conversation turns) and has been retired. The current recipe builds llama.cpp directly with the Vulkan backend on stock Mesa/RADV: faster (~48 vs ~35 tok/s generation), more stable, and one fewer kernel parameter — amdgpu.no_system_mem_limit=1 was a ROCm-path requirement and is not needed on Vulkan.

What this recipe produces

A 96 GB Strix Halo box (or any AMD Ryzen AI MAX+ 395 platform with comparable BIOS access)
gpt-oss-120b (MXFP4 GGUF) at full GPU offload via llama.cpp + Vulkan/RADV
The full native 131,072-token context window, stable— ~110k-token prefills with zero GPU resets (requires -ub 512; see Step 7)
~48 tok/s generation in interactive multi-turn use
An OpenAI-compatible HTTP API on the LAN at port 8080, run as a systemd service that survives reboots
~65 GiB of 89 GiB system memory in use with the model resident at full context

Prerequisites

Hardware

AMD Ryzen AI MAX+ 395 (“Strix Halo”) APU with Radeon 8060S iGPU (gfx1151)
96 GB or 128 GB unified LPDDR5X memory (96 GB is sufficient for gpt-oss-120b at full 128K context)
~80 GB free on NVMe for the model weights (3-part GGUF, ~59 GB)
LAN connectivity

Software

Ubuntu 26.04 LTS, fresh install with kernel 7.0+
sudo access
No ROCm install, no PPA, no mesa-git: stock Mesa 26.0.3 already identifies the device as RADV STRIX_HALO

Software versions verified

Ubuntu 26.04 LTS (Resolute Raccoon)

kernel 7.0.0-22-generic

Mesa / RADV 26.0.3-1ubuntu1

llama.cpp build 4fb16ec (GGML_VULKAN=ON, Release)

gpt-oss-120b MXFP4 GGUF (ggml-org), 3 parts, ~63 GB

Step 1 — BIOS: reduce UMA Frame Buffer to minimum

At boot, enter the BIOS setup utility (typically Delete or F2 on the EVO-X2). Find the UMA Frame Buffer Size setting (under Memory or Advanced/AMD CBS depending on platform).

Set to the BIOS minimum. On the GMKtec EVO-X2, that minimum is 2 GB. Save and reboot.

Verify after boot:

sudo dmesg | grep "of VRAM memory ready"
# Expect: amdgpu 0000:c5:00.0:  2048M of VRAM memory ready

On a unified-memory APU, BIOS-reserved VRAM is invisible to Linux as system RAM. The iGPU reaches the model through the kernel's GTT pool (ordinary system RAM pinned for GPU use), so a big fixed carve-out only shrinks the pool everything actually runs in. With 2 GB reserved, the 96 GB box exposes 89 GiB to Linux.

Step 2 — Kernel command line: raise the GTT cap

sudo cp /etc/default/grub /etc/default/grub.bak.$(date +%s)
sudo sed -i 's|^GRUB_CMDLINE_LINUX_DEFAULT="[^"]*"|GRUB_CMDLINE_LINUX_DEFAULT="amd_iommu=off ttm.pages_limit=20971520 ttm.page_pool_size=20971520"|' /etc/default/grub
sudo update-grub
sudo systemctl reboot

20971520pages × 4 KiB = 80 GiB GPU-allocatable GTT. ttm.page_pool_size matches so the TTM page pool can back the full cap. amd_iommu=off removes IOMMU translation overhead on the unified-memory path.

Verify after reboot:

cat /proc/cmdline | tr ' ' '\n' | grep -E 'ttm|iommu'
# Expect:
#   amd_iommu=off
#   ttm.pages_limit=20971520
#   ttm.page_pool_size=20971520

cat /sys/module/ttm/parameters/pages_limit
# Expect: 20971520

sudo dmesg | grep "of GTT memory ready"
# Expect: amdgpu 0000:c5:00.0:  81920M of GTT memory ready.

The kernel does not auto-raise the GTT allocation cap to match installed RAM — without ttm.pages_limit, large single allocations fail and the model cannot fully offload.

What you do NOT need on this stack: amdgpu.no_system_mem_limit=1. That parameter works around a cap in the ROCm SVM allocation path; llama.cpp's Vulkan backend allocates through RADV/GTT and never hits it. (amdgpu.gttsize=is deprecated and ignored on kernel 7.0 — some older guides still recommend it.)

Step 3 — Add your user to the render and video groups

sudo usermod -aG render,video $USER
# Log out and back in (or reboot) for membership to apply

Verify:

groups | tr ' ' '\n' | grep -E 'render|video'
# Expect both: render
#              video

sudo apt install -y vulkan-tools
vulkaninfo --summary | grep deviceName
# Expect: deviceName = Radeon 8060S Graphics (RADV STRIX_HALO)

Without render/video membership, the user cannot open /dev/dri/renderD*. Vulkan then silently falls back to llvmpipe (CPU software rasterizer), and llama.cpp sees no usable GPU. This is the single most common way this build “works but is 50× too slow.”

Step 4 — Build llama.cpp with the Vulkan backend

sudo apt install -y build-essential cmake git libvulkan-dev glslc glslang-tools spirv-headers

cd ~
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build -DGGML_VULKAN=ON -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release -j$(nproc)

Verify:

./build/bin/llama-server --version
# Expect a version/build line, built for Linux x86_64

./build/bin/llama-server --list-devices 2>&1 | grep -i vulkan
# Expect a Vulkan0 device naming the Radeon 8060S (RADV STRIX_HALO)

Distro/prebuilt llama.cpp binaries are usually CPU-only or CUDA. The Vulkan backend must be compiled in (GGML_VULKAN=ON), and it builds against stock Ubuntu 26.04 Vulkan SDK packages — no ROCm, no AMD driver download.

Step 5 — Download the model (3-part MXFP4 GGUF, ~63 GB)

sudo mkdir -p /data/models/gpt-oss-120b && sudo chown $USER: /data/models/gpt-oss-120b
pip install -U "huggingface_hub[cli]"
hf download ggml-org/gpt-oss-120b-GGUF \
  --include "*mxfp4*" --local-dir /data/models/gpt-oss-120b

Verify:

ls -l /data/models/gpt-oss-120b/
# Expect three files, ~63 GB total:
#   gpt-oss-120b-mxfp4-00001-of-00003.gguf   (~13 MB index part)
#   gpt-oss-120b-mxfp4-00002-of-00003.gguf   (~32 GB)
#   gpt-oss-120b-mxfp4-00003-of-00003.gguf   (~32 GB)

Put models on your largest/fastest NVMe (this recipe uses /data, a dedicated ext4 partition). llama.cpp is pointed at part 1 and finds the rest automatically.

Step 6 — Run llama-server as a systemd service

sudo tee /etc/systemd/system/llama-server.service <<'EOF'
[Unit]
Description=llama.cpp server (gpt-oss-120b, Vulkan)
After=network-online.target
Wants=network-online.target
StartLimitIntervalSec=120
StartLimitBurst=3

[Service]
Type=simple
User=youruser
Group=youruser
SupplementaryGroups=render video
ExecStart=/home/youruser/llama.cpp/build/bin/llama-server -m /data/models/gpt-oss-120b/gpt-oss-120b-mxfp4-00001-of-00003.gguf -ngl 999 -c 131072 --jinja -fa on -ub 512 -b 2048 --host 0.0.0.0 --port 8080
Restart=on-failure
RestartSec=5

[Install]
WantedBy=multi-user.target
EOF

sudo systemctl daemon-reload
sudo systemctl enable --now llama-server.service

(Replace youruser with your user.) Flag by flag:

-ngl 999— offload all layers to the GPU.
-c 131072— the model's full native context window. It fits: gpt-oss's GQA + sliding-window attention keep the KV cache small (~3 GiB more than at 32K).
--jinja— use the model's own chat template (required for correct gpt-oss tool-calling and the harmony format).
-fa on— flash attention. Sensible, but it does NOT by itself prevent the long-context GPU crash (Step 7).
-ub 512 — load-bearing for long-context stability. Do not raise it. See Step 7.
SupplementaryGroups=render video — the service user needs GPU device access even under systemd.
StartLimitBurst=3over 120 s — repeated crashes stay failed and visible instead of silently restart-looping.

Verify (model load takes ~36 s):

sleep 40 && curl -sf http://127.0.0.1:8080/health
# Expect: {"status":"ok"}

journalctl -u llama-server -b --no-pager | grep -iE "vulkan|n_ctx" | head
# Expect a Vulkan device line naming RADV STRIX_HALO and n_ctx = 131072

Step 7 — The flag that keeps 128K from crashing the GPU: -ub 512

This is the part most guides do not cover, because it only bites past ~80,000 tokens of prefill. Each prefill compute submission covers -ub (micro-batch) tokens. At high context, attention over the large KV cache makes a single submission expensive enough to exceed the amdgpu compute-ring watchdog. The kernel resets the ring, the Vulkan device is lost, and llama-server dies mid-request:

amdgpu: ring comp_1.1.0 timeout, signaled seq=330115, emitted seq=330117
amdgpu: Starting comp_1.1.0 ring reset ... device wedged, but recovered through reset
llama-server: terminate called after throwing 'vk::DeviceLostError'
systemd: llama-server.service: Failed with result 'core-dump'

Measured on this box, one variable at a time:

Config	64k	~80k	~88k	~110k
-ub 2048, no FA	OK	—	DeviceLost	—
-ub 2048 + -fa on	—	—	DeviceLost	—
-ub 1024 + -fa on	—	DeviceLost	—	—
-ub 512 + -fa on	—	OK	OK	OK

Flash attention alone does not fix it.
-ub 1024still crashes, and its throughput advantage decays with context anyway (~360 tok/s at 55k → ~155 tok/s by 80k).
-ub 512is the only tested config that completes a full ~110k-token prefill — and it held through ~1 hour of sustained back-to-back 110k prefills with zero device-lost events.

Retrieval quality at the full window, same config: a needle-in-a-haystack fact planted at 10% / 50% / 90% depth of a ~110k-token document was retrieved exactly at all three depths. Full data and the debugging story: /blog/gpt-oss-120b-128k-context-strix-halo.

Step 8 — Host tuning (optional but used on the verified box)

echo "vm.swappiness=10" | sudo tee /etc/sysctl.d/99-llm-tuning.conf
sudo sysctl -p /etc/sysctl.d/99-llm-tuning.conf

sudo tee /etc/systemd/system/llm-tuning.service <<'EOF'
[Unit]
Description=LLM tuning: transparent huge pages + AMD GPU power level
After=multi-user.target

[Service]
Type=oneshot
ExecStart=/bin/sh -c "echo always > /sys/kernel/mm/transparent_hugepage/enabled"
ExecStart=/bin/sh -c "echo always > /sys/kernel/mm/transparent_hugepage/defrag"
ExecStart=/bin/sh -c "for f in /sys/class/drm/card*/device/power_dpm_force_performance_level; do [ -w \"$f\" ] && echo auto > \"$f\"; done"
RemainAfterExit=true

[Install]
WantedBy=multi-user.target
EOF

sudo systemctl daemon-reload
sudo systemctl enable --now llm-tuning.service

Use auto, not high, for power_dpm_force_performance_level. The auto governor still ramps the GPU fully during inference; pinning high removes cooling time between bursts on a small-form-factor chassis.

Step 9 — Smoke test the OpenAI-compatible API

curl -s http://127.0.0.1:8080/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "gpt-oss-120b",
    "messages": [{"role": "user", "content": "Reply with exactly two words."}],
    "max_tokens": 32
  }'

Expect an OpenAI-style JSON response in ~1–2 s warm. Any OpenAI-compatible client on the LAN can point at http://<box-ip>:8080/v1.

gpt-oss quirk: the model sometimes returns an empty content with the entire answer in reasoning_content(its harmony “analysis” channel). Clients must read both fields. Set reasoning effort via {"chat_template_kwargs": {"reasoning_effort": "low"}}.

Performance targets

Measured on the verified box (96 GB EVO-X2, locked config above). If your numbers are >20% off, re-check Steps 1–3.

Metric	Value
Generation, short context	~48 tok/s
Generation, interactive multi-turn	~48 tok/s
Prefill at ~88k tokens (single prompt)	~164 tok/s
Prefill at ~110k tokens (single prompt)	~195 tok/s
Prefill, sustained back-to-back ~110k prompts	~100–113 tok/s (thermal regime)
Memory with model resident at 128K	~65 GiB of 89 GiB
Cold model load (service start)	~36 s
NIAH retrieval at 10/50/90% depth of ~110k tokens	3/3 exact

For reference, prefill at short-to-mid context is much faster (722 tok/s at 16k and 877 tok/s at 32k were measured with -ub 2048 before the stability fix; -ub 512 trades some of that for a window that does not crash).

Troubleshooting

llama.cpp reports no GPU / generation is absurdly slow

Vulkan fell back to llvmpipe. vulkaninfo --summary must list the Radeon device. Almost always missing render/videogroup membership (Step 3) — including for the systemd service user (SupplementaryGroups=).

vk::DeviceLostError / core-dump during a long prompt, with amdgpu ring timeout in dmesg

Micro-batch too large for long-context prefill on this iGPU. Set -ub 512 (Step 7). Flash attention alone will not fix it. After the crash the kernel recovers the GPU via ring reset and systemd restarts the service (~45 s), but the in-flight request is lost.

dmesg shows less GTT than expected

Kernel cmdline didn't take effect (Step 2), or the BIOS UMA carve-out is still large (Step 1) — every GiB the BIOS reserves is a GiB Linux never sees.

Responses have empty content

Not a failure. Read reasoning_content(harmony analysis channel) as the fallback — see Step 9.

Service restart-loops after repeated crashes

By design it stops: StartLimitBurst=3 in 120 s leaves the unit failed so the problem is visible rather than silently looping. sudo systemctl reset-failed llama-server && sudo systemctl start llama-server after fixing the cause.

What this recipe does NOT cover

Public exposure — this is a LAN-only configuration. Public internet exposure needs reverse proxy, auth, and rate limiting; not in scope.
ROCm — not used here at all. Vulkan/RADV on stock Mesa is the verified path on this hardware.
Ollama — the previous version of this recipe used it; retired after a documented gfx1151 output-corruption failure mode (repeating characters after several turns) that does not reproduce on llama.cpp + Vulkan.
Agent/assistant frontends — this recipe ends at a clean OpenAI-compatible endpoint. What you point at it is up to you.

Why this recipe exists

This is the local-LLM testbed configuration we run for AGLedger, a cryptographic notary for automated work. We needed a reproducible, on-premises, frontier-quality LLM environment to test agent workloads against — including whether a local 120B model can drive real tool-calling loops, and what that means for proving what an agent actually did. If you are running local agents and want a tamper-evident record of every turn they take, AGLedger Developer Edition is free and fully unlocked, runs offline, no phone home: /install.

storyThe build story: gpt-oss-120b on Strix Halo + Ubuntu 26.04

deep-diveSurviving 128K context: the -ub 512 story

installAGLedger Developer Edition install

gpt-oss-120b on Strix Halo + Ubuntu 26.04 (llama.cpp + Vulkan)

What this recipe produces

Prerequisites

Software versions verified

Step 1 — BIOS: reduce UMA Frame Buffer to minimum

Step 2 — Kernel command line: raise the GTT cap

Step 3 — Add your user to the render and video groups

Step 4 — Build llama.cpp with the Vulkan backend

Step 5 — Download the model (3-part MXFP4 GGUF, ~63 GB)

Step 6 — Run llama-server as a systemd service

Step 7 — The flag that keeps 128K from crashing the GPU: -ub 512

Step 8 — Host tuning (optional but used on the verified box)

Step 9 — Smoke test the OpenAI-compatible API

Performance targets

Troubleshooting

What this recipe does NOT cover

Why this recipe exists

Related