Recipe · verified 2026-05-06
Engineering recipegpt-oss:120b on Strix Halo + Ubuntu 26.04
Agent-readable version
Plain markdown at /recipes/local-llm-strix-halo-ubuntu-26-04.md. Fetch with one curl, ingest with any LLM tool, license CC0.
curl -O https://agledger.ai/recipes/local-llm-strix-halo-ubuntu-26-04.md
What this recipe produces
- A 96 GB Strix Halo box (or any AMD Ryzen AI MAX+ 395 platform with comparable BIOS access)
- gpt-oss:120b sustained at ~35 tok/s, 100% GPU offload, 128K context window
- Ollama HTTP API reachable on the LAN at port 11434
- Optional OpenClaw agent gateway bridging Telegram, Slack, Discord, and other channels to the local model
- Idle wall power ~40 W; sustained inference 100-140 W; cold load ~24 s; warm one-word response ~1.2 s direct or under 4 s through OpenClaw
Prerequisites
Hardware
- AMD Ryzen AI MAX+ 395 (“Strix Halo”) APU with Radeon 8060S iGPU (gfx1151)
- 96 GB or 128 GB unified LPDDR5X memory (96 GB is sufficient for gpt-oss:120b at 128K context)
- ~150 GB free on NVMe (model weights are 65 GB; allow space for additional models)
- LAN connectivity
Software
- Ubuntu 26.04 LTS, fresh install with kernel 7.0+
sudoaccess- About 30 minutes wall-clock for first cold pull and configuration
Software versions verified
Ubuntu 26.04 LTS (Resolute Raccoon)
kernel 7.0.0-15-generic
Mesa / RADV 26.0.3-1ubuntu1
amdgpu DRM driver 3.64.0
Ollama 0.22.1
OpenClaw 2026.5.4
Node.js 24.15.0
Step 1 — BIOS: reduce UMA Frame Buffer to minimum
At boot, enter the BIOS setup utility (typically Delete or F2 on the EVO-X2). Find the UMA Frame Buffer Size setting (under Memory or Advanced/AMD CBS depending on platform).
Set to the BIOS minimum. On the GMKtec EVO-X2, that minimum is 2 GB. Save and reboot.
Verify after boot:
sudo dmesg | grep "of VRAM memory ready" # Expect: amdgpu 0000:c4:00.0: 2048M of VRAM memory ready
On a unified-memory APU, BIOS-reserved VRAM is invisible to Linux as system RAM. The 32 GB default reserves 32 GB you cannot use for the model. Reducing to 2 GB frees ~28 GB for system + GPU use through the GTT pool.
Step 2 — Kernel command line: set the parameter trio
sudo cp /etc/default/grub /etc/default/grub.bak.$(date +%s) sudo sed -i 's|^GRUB_CMDLINE_LINUX_DEFAULT="[^"]*"|GRUB_CMDLINE_LINUX_DEFAULT="ttm.pages_limit=23068672 amdgpu.no_system_mem_limit=1"|' /etc/default/grub sudo update-grub sudo systemctl reboot
23068672 pages × 4 KB = 88 GiB GPU-allocatable cap. amdgpu.no_system_mem_limit=1 removes a separate AMDGPU-level cap on system memory pinning.
Verify after reboot:
cat /proc/cmdline | tr ' ' '\n' | grep -E 'ttm|amdgpu' # Expect: # ttm.pages_limit=23068672 # amdgpu.no_system_mem_limit=1 cat /sys/module/ttm/parameters/pages_limit # Expect: 23068672 dmesg | grep "of GTT memory ready" # Expect: amdgpu 0000:c4:00.0: 90112M of GTT memory ready.
The kernel auto-detects how much GTT exists, but does NOT auto-bump the allocation cap past the BIOS VRAM slice. Older guides recommend amdgpu.gttsize= instead; that parameter is deprecated and ignored on kernel 7.0.
Step 3 — Install Ollama
curl -fsSL https://ollama.com/install.sh | sudo sh
The installer detects AMD via lspci, pulls the ROCm-tagged tarball with bundled gfx1151 kernels, registers the user with the ollama group, and starts a systemd service.
Verify:
ollama --version # Expect: 0.22.1 or newer systemctl is-active ollama # Expect: active sudo journalctl -u ollama --since '1 minute ago' | grep "library=ROCm" # Expect: ...library=ROCm compute=gfx1151 ... type=iGPU
Ollama 0.22.1+ ships its own ROCm runtime. You do NOT need a separate ROCm install, do NOT need HSA_OVERRIDE_GFX_VERSION, and do NOT need to force Vulkan.
Step 4 — Configure Ollama for LAN access and large context
sudo mkdir -p /etc/systemd/system/ollama.service.d sudo tee /etc/systemd/system/ollama.service.d/override.conf <<'EOF' [Service] Environment="OLLAMA_HOST=0.0.0.0:11434" Environment="OLLAMA_KEEP_ALIVE=30m" Environment="OLLAMA_CONTEXT_LENGTH=131072" Environment="OLLAMA_FLASH_ATTENTION=1" Environment="OLLAMA_KV_CACHE_TYPE=q8_0" EOF sudo systemctl daemon-reload sudo systemctl restart ollama
Always pin OLLAMA_CONTEXT_LENGTH explicitly. The auto-context picker over-shoots on this chip (62 GiB available → defaults to 262144 even for models with n_ctx_train=131072). Flash Attention + Q8 KV cache cuts per-token KV cost to ~22 KB, making 128K context comfortable.
Step 5 — Pull gpt-oss:120b
ollama pull gpt-oss:120b # ~65 GB MXFP4 download
Do NOT pull while another large model is keep-alive resident. Concurrent VRAM pressure plus heavy disk I/O has produced ROCm SIGSEGVs on this chip. Symptom: llama runner process has terminated: %!w(<nil>).
Step 6 — Persistent sysctl + sysfs tuning
echo "vm.swappiness=10" | sudo tee /etc/sysctl.d/99-llm-tuning.conf sudo sysctl -p /etc/sysctl.d/99-llm-tuning.conf sudo tee /etc/systemd/system/llm-tuning.service <<'EOF' [Unit] Description=LLM tuning: transparent huge pages + AMD GPU power level After=multi-user.target [Service] Type=oneshot ExecStart=/bin/sh -c "echo always > /sys/kernel/mm/transparent_hugepage/enabled" ExecStart=/bin/sh -c "echo always > /sys/kernel/mm/transparent_hugepage/defrag" ExecStart=/bin/sh -c "for f in /sys/class/drm/card*/device/power_dpm_force_performance_level; do [ -w \"$f\" ] && echo auto > \"$f\"; done" RemainAfterExit=true [Install] WantedBy=multi-user.target EOF sudo systemctl daemon-reload sudo systemctl enable --now llm-tuning.service
Use auto, not high. Pinning to high removes cooling time between inference bursts and has produced silicon-level thermal cutoffs (hard power-off, no kernel log) on small-form-factor chassis under sustained agent traffic. The kernel's auto governor still ramps the GPU to max during inference; it just relaxes between turns. Same throughput, lower thermal risk.
Step 7 — Smoke test: load and run gpt-oss:120b
ollama run gpt-oss:120b "Reply with exactly two words." --verbose ollama ps
Expect from ollama ps:
NAME ID SIZE PROCESSOR CONTEXT UNTIL gpt-oss:120b a951a23b46a1 68 GB 100% GPU 131072 29 minutes from now
Critical: PROCESSOR must show 100% GPU.
Any non-100% split (e.g. 11%/89% CPU/GPU) is a configuration failure, not a tradeoff. Partial CPU offload has been measured at 0.27 tok/s vs 35.5 tok/s when fully on GPU — ~130× difference. If you see a split, recheck Steps 1-2.
Step 8 — (Optional) Install OpenClaw agent gateway
OpenClaw bridges messaging channels (Telegram, Slack, Discord, iMessage, WhatsApp, Signal, others) to the local Ollama backend.
# Node 24 LTS via NodeSource
curl -fsSL https://deb.nodesource.com/setup_24.x | sudo bash -
sudo apt install -y nodejs
# OpenClaw
sudo npm install -g openclaw@latest
openclaw onboard \
--non-interactive --accept-risk \
--auth-choice ollama \
--install-daemon \
--gateway-bind loopback --gateway-auth token
openclaw config set agents.defaults.model.primary "ollama/gpt-oss:120b"
# CRITICAL: raise the per-provider request timeout
cat <<'PATCH' | openclaw config patch --stdin
{ "models": { "providers": { "ollama": { "timeoutSeconds": 600 } } } }
PATCH
# Recommended on a memory-tight box: remove fallback prewarming
openclaw config unset agents.defaults.model.fallbacks
# User-scope service across logouts
sudo loginctl enable-linger $USER
systemctl --user enable --now openclaw-gatewayWhy timeoutSeconds: 600 is critical: OpenClaw's default per-provider timeout (~140 seconds) is too short for gpt-oss:120b's reasoning phase. Reasoning models stream delta.reasoning tokens with no delta.content until they finish thinking, and the gateway waits for content. Symptom: FailoverError: LLM request timed out. Raising to 600 s eliminates the false failovers.
Why we remove fallback prewarming: OpenClaw's model-prewarm sidecar loads ALL configured fallbacks at gateway startup, not just on actual failover. On a 88 GB-cap iGPU with a 65 GB primary, prewarming a 12 GB fallback eats most of the headroom you tuned the rest of the system to keep.
Step 9 — End-to-end verification
# Direct Ollama probe (model warm) — should return in ~1.2 s
curl -s http://127.0.0.1:11434/api/generate -d '{
"model": "gpt-oss:120b",
"prompt": "Reply with the single word: READY",
"stream": false
}' | grep -oE '"response":"[^"]*"'
# OpenClaw gateway probe (warm) — should return under 4 s
time openclaw infer model run --gateway --model ollama/gpt-oss:120b \
--prompt "Reply with the single word: READY"
# Confirm 100% GPU and 128K context
ollama ps
# Optional: install lm-sensors for ongoing thermal monitoring
sudo apt install -y lm-sensors
yes "" | sudo sensors-detect --auto
sensorsBenchmark targets
Run bench.py to compare. Expect within ~10% of these numbers on a properly configured 96 GB box:
| Model | Resident | Code | Reasoning | Architecture |
|---|---|---|---|---|
| gemma4 | 10 GB | 54.3 tok/s | 52.0 tok/s | Dense |
| gpt-oss:20b | 13 GB | 48.7 tok/s | 47.4 tok/s | Dense, reasoning |
| gpt-oss:120b | 65 GB | 35.5 tok/s | 34.9 tok/s | MoE, ~5B active |
| llama3.3:70b | 57 GB | 5.1 tok/s | 5.1 tok/s | Dense |
Troubleshooting
unable to allocate ROCm0 buffer
TTM cap is set, but amdgpu.no_system_mem_limit=1 is missing OR the BIOS UMA reduction was not applied. Check both. The trio (Steps 1+2) is non-optional for gpt-oss:120b-class loads.
ollama ps shows 11%/89% CPU/GPU
Not enough headroom. In order of likelihood: (1) BIOS UMA Frame Buffer is still at 32 GB default; (2) OLLAMA_CONTEXT_LENGTH too high; (3) another model is keep-alive resident.
llama runner process has terminated
ROCm SIGSEGV under concurrent VRAM pressure. Avoid running ollama pull while a large model is keep-alive resident.
FailoverError: LLM request timed out
Default OpenClaw per-provider timeout too short. Raise models.providers.ollama.timeoutSeconds to 600 (Step 8).
Hard power-off during sustained inference, no kernel log
Silicon-level thermal cutoff. GPU pinned to high instead of auto. Switch to auto in llm-tuning.service (Step 6) and reboot.
sqlite-vec unavailable warning from OpenClaw
Known upstream ABI mismatch between better-sqlite3 (SQLite 3.51) and the precompiled sqlite-vec binary (3.45). OpenClaw falls back to in-process cosine similarity. No suppression flag exists. Ignore.
Out of scope for this recipe
- Public exposure — LAN-only configuration. Reverse proxy + auth + rate limiting are a separate procedure.
- Multi-user OpenClaw — single-owner Telegram-bot configuration.
- Vulkan as primary backend — if you want to compare ROCm vs Vulkan, build llama.cpp directly.
- Air-gapped install — possible but requires offline mirroring of Ollama, OpenClaw, and the model.
Why this recipe exists
This is the local-LLM testbed configuration we run for AGLedger, a cryptographic notary for automated work. We needed a reproducible, on-premises, near-frontier-quality LLM environment to test agent accountability flows against.
The full story — failure modes, before/after benchmarks, and the older-guides-vs-26.04 comparison — is in the companion blog post: Near Frontier-Quality LLM, No Cloud, No Subscription, Unlimited Tokens.
If you are running local agents and want a tamper-evident chain of every turn they take, AGLedger Developer Edition is free and fully unlocked, runs offline, no phone home.
Sources & further reading
/recipes/local-llm-strix-halo-ubuntu-26-04.md — agent-readable plain markdown of this recipe
Companion blog post — the story, benchmarks, and full configuration walk-through
bench.py — stdlib-only Python benchmark script
Ollama OpenClaw integration docs