Recipe · verified 2026-05-06

Engineering recipe

gpt-oss:120b on Strix Halo + Ubuntu 26.04

Agent-readable version

Plain markdown at /recipes/local-llm-strix-halo-ubuntu-26-04.md. Fetch with one curl, ingest with any LLM tool, license CC0.

curl -O https://agledger.ai/recipes/local-llm-strix-halo-ubuntu-26-04.md

What this recipe produces

A 96 GB Strix Halo box (or any AMD Ryzen AI MAX+ 395 platform with comparable BIOS access)
gpt-oss:120b sustained at ~35 tok/s, 100% GPU offload, 128K context window
Ollama HTTP API reachable on the LAN at port 11434
Optional OpenClaw agent gateway bridging Telegram, Slack, Discord, and other channels to the local model
Idle wall power ~40 W; sustained inference 100-140 W; cold load ~24 s; warm one-word response ~1.2 s direct or under 4 s through OpenClaw

Prerequisites

Hardware

AMD Ryzen AI MAX+ 395 (“Strix Halo”) APU with Radeon 8060S iGPU (gfx1151)
96 GB or 128 GB unified LPDDR5X memory (96 GB is sufficient for gpt-oss:120b at 128K context)
~150 GB free on NVMe (model weights are 65 GB; allow space for additional models)
LAN connectivity

Software

Ubuntu 26.04 LTS, fresh install with kernel 7.0+
sudo access
About 30 minutes wall-clock for first cold pull and configuration

Software versions verified

Ubuntu 26.04 LTS (Resolute Raccoon)

kernel 7.0.0-15-generic

Mesa / RADV 26.0.3-1ubuntu1

amdgpu DRM driver 3.64.0

Ollama 0.22.1

OpenClaw 2026.5.4

Node.js 24.15.0

Step 1 — BIOS: reduce UMA Frame Buffer to minimum

At boot, enter the BIOS setup utility (typically Delete or F2 on the EVO-X2). Find the UMA Frame Buffer Size setting (under Memory or Advanced/AMD CBS depending on platform).

Set to the BIOS minimum. On the GMKtec EVO-X2, that minimum is 2 GB. Save and reboot.

Verify after boot:

sudo dmesg | grep "of VRAM memory ready"
# Expect: amdgpu 0000:c4:00.0:  2048M of VRAM memory ready

On a unified-memory APU, BIOS-reserved VRAM is invisible to Linux as system RAM. The 32 GB default reserves 32 GB you cannot use for the model. Reducing to 2 GB frees ~28 GB for system + GPU use through the GTT pool.

Step 2 — Kernel command line: set the parameter trio

sudo cp /etc/default/grub /etc/default/grub.bak.$(date +%s)
sudo sed -i 's|^GRUB_CMDLINE_LINUX_DEFAULT="[^"]*"|GRUB_CMDLINE_LINUX_DEFAULT="ttm.pages_limit=23068672 amdgpu.no_system_mem_limit=1"|' /etc/default/grub
sudo update-grub
sudo systemctl reboot

23068672 pages × 4 KB = 88 GiB GPU-allocatable cap. amdgpu.no_system_mem_limit=1 removes a separate AMDGPU-level cap on system memory pinning.

Verify after reboot:

cat /proc/cmdline | tr ' ' '\n' | grep -E 'ttm|amdgpu'
# Expect:
#   ttm.pages_limit=23068672
#   amdgpu.no_system_mem_limit=1

cat /sys/module/ttm/parameters/pages_limit
# Expect: 23068672

dmesg | grep "of GTT memory ready"
# Expect: amdgpu 0000:c4:00.0:  90112M of GTT memory ready.

The kernel auto-detects how much GTT exists, but does NOT auto-bump the allocation cap past the BIOS VRAM slice. Older guides recommend amdgpu.gttsize= instead; that parameter is deprecated and ignored on kernel 7.0.

Step 3 — Install Ollama

curl -fsSL https://ollama.com/install.sh | sudo sh

The installer detects AMD via lspci, pulls the ROCm-tagged tarball with bundled gfx1151 kernels, registers the user with the ollama group, and starts a systemd service.

Verify:

ollama --version
# Expect: 0.22.1 or newer

systemctl is-active ollama
# Expect: active

sudo journalctl -u ollama --since '1 minute ago' | grep "library=ROCm"
# Expect: ...library=ROCm compute=gfx1151 ... type=iGPU

Ollama 0.22.1+ ships its own ROCm runtime. You do NOT need a separate ROCm install, do NOT need HSA_OVERRIDE_GFX_VERSION, and do NOT need to force Vulkan.

Step 4 — Configure Ollama for LAN access and large context

sudo mkdir -p /etc/systemd/system/ollama.service.d
sudo tee /etc/systemd/system/ollama.service.d/override.conf <<'EOF'
[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"
Environment="OLLAMA_KEEP_ALIVE=30m"
Environment="OLLAMA_CONTEXT_LENGTH=131072"
Environment="OLLAMA_FLASH_ATTENTION=1"
Environment="OLLAMA_KV_CACHE_TYPE=q8_0"
EOF

sudo systemctl daemon-reload
sudo systemctl restart ollama

Always pin OLLAMA_CONTEXT_LENGTH explicitly. The auto-context picker over-shoots on this chip (62 GiB available → defaults to 262144 even for models with n_ctx_train=131072). Flash Attention + Q8 KV cache cuts per-token KV cost to ~22 KB, making 128K context comfortable.

Step 5 — Pull gpt-oss:120b

ollama pull gpt-oss:120b
# ~65 GB MXFP4 download

Do NOT pull while another large model is keep-alive resident. Concurrent VRAM pressure plus heavy disk I/O has produced ROCm SIGSEGVs on this chip. Symptom: llama runner process has terminated: %!w(<nil>).

Step 6 — Persistent sysctl + sysfs tuning

echo "vm.swappiness=10" | sudo tee /etc/sysctl.d/99-llm-tuning.conf
sudo sysctl -p /etc/sysctl.d/99-llm-tuning.conf

sudo tee /etc/systemd/system/llm-tuning.service <<'EOF'
[Unit]
Description=LLM tuning: transparent huge pages + AMD GPU power level
After=multi-user.target

[Service]
Type=oneshot
ExecStart=/bin/sh -c "echo always > /sys/kernel/mm/transparent_hugepage/enabled"
ExecStart=/bin/sh -c "echo always > /sys/kernel/mm/transparent_hugepage/defrag"
ExecStart=/bin/sh -c "for f in /sys/class/drm/card*/device/power_dpm_force_performance_level; do [ -w \"$f\" ] && echo auto > \"$f\"; done"
RemainAfterExit=true

[Install]
WantedBy=multi-user.target
EOF

sudo systemctl daemon-reload
sudo systemctl enable --now llm-tuning.service

Use auto, not high. Pinning to high removes cooling time between inference bursts and has produced silicon-level thermal cutoffs (hard power-off, no kernel log) on small-form-factor chassis under sustained agent traffic. The kernel's auto governor still ramps the GPU to max during inference; it just relaxes between turns. Same throughput, lower thermal risk.

Step 7 — Smoke test: load and run gpt-oss:120b

ollama run gpt-oss:120b "Reply with exactly two words." --verbose
ollama ps

Expect from ollama ps:

NAME            ID              SIZE     PROCESSOR    CONTEXT    UNTIL
gpt-oss:120b    a951a23b46a1    68 GB    100% GPU     131072     29 minutes from now

Critical: PROCESSOR must show 100% GPU.

Any non-100% split (e.g. 11%/89% CPU/GPU) is a configuration failure, not a tradeoff. Partial CPU offload has been measured at 0.27 tok/s vs 35.5 tok/s when fully on GPU — ~130× difference. If you see a split, recheck Steps 1-2.

Step 8 — (Optional) Install OpenClaw agent gateway

OpenClaw bridges messaging channels (Telegram, Slack, Discord, iMessage, WhatsApp, Signal, others) to the local Ollama backend.

# Node 24 LTS via NodeSource
curl -fsSL https://deb.nodesource.com/setup_24.x | sudo bash -
sudo apt install -y nodejs

# OpenClaw
sudo npm install -g openclaw@latest
openclaw onboard \
  --non-interactive --accept-risk \
  --auth-choice ollama \
  --install-daemon \
  --gateway-bind loopback --gateway-auth token

openclaw config set agents.defaults.model.primary "ollama/gpt-oss:120b"

# CRITICAL: raise the per-provider request timeout
cat <<'PATCH' | openclaw config patch --stdin
{ "models": { "providers": { "ollama": { "timeoutSeconds": 600 } } } }
PATCH

# Recommended on a memory-tight box: remove fallback prewarming
openclaw config unset agents.defaults.model.fallbacks

# User-scope service across logouts
sudo loginctl enable-linger $USER
systemctl --user enable --now openclaw-gateway

Why timeoutSeconds: 600 is critical: OpenClaw's default per-provider timeout (~140 seconds) is too short for gpt-oss:120b's reasoning phase. Reasoning models stream delta.reasoning tokens with no delta.content until they finish thinking, and the gateway waits for content. Symptom: FailoverError: LLM request timed out. Raising to 600 s eliminates the false failovers.

Why we remove fallback prewarming: OpenClaw's model-prewarm sidecar loads ALL configured fallbacks at gateway startup, not just on actual failover. On a 88 GB-cap iGPU with a 65 GB primary, prewarming a 12 GB fallback eats most of the headroom you tuned the rest of the system to keep.

Step 9 — End-to-end verification

# Direct Ollama probe (model warm) — should return in ~1.2 s
curl -s http://127.0.0.1:11434/api/generate -d '{
  "model": "gpt-oss:120b",
  "prompt": "Reply with the single word: READY",
  "stream": false
}' | grep -oE '"response":"[^"]*"'

# OpenClaw gateway probe (warm) — should return under 4 s
time openclaw infer model run --gateway --model ollama/gpt-oss:120b \
    --prompt "Reply with the single word: READY"

# Confirm 100% GPU and 128K context
ollama ps

# Optional: install lm-sensors for ongoing thermal monitoring
sudo apt install -y lm-sensors
yes "" | sudo sensors-detect --auto
sensors

Benchmark targets

Run bench.py to compare. Expect within ~10% of these numbers on a properly configured 96 GB box:

Model	Resident	Code	Reasoning	Architecture
gemma4	10 GB	54.3 tok/s	52.0 tok/s	Dense
gpt-oss:20b	13 GB	48.7 tok/s	47.4 tok/s	Dense, reasoning
gpt-oss:120b	65 GB	35.5 tok/s	34.9 tok/s	MoE, ~5B active
llama3.3:70b	57 GB	5.1 tok/s	5.1 tok/s	Dense

Troubleshooting

unable to allocate ROCm0 buffer

TTM cap is set, but amdgpu.no_system_mem_limit=1 is missing OR the BIOS UMA reduction was not applied. Check both. The trio (Steps 1+2) is non-optional for gpt-oss:120b-class loads.

ollama ps shows 11%/89% CPU/GPU

Not enough headroom. In order of likelihood: (1) BIOS UMA Frame Buffer is still at 32 GB default; (2) OLLAMA_CONTEXT_LENGTH too high; (3) another model is keep-alive resident.

llama runner process has terminated

ROCm SIGSEGV under concurrent VRAM pressure. Avoid running ollama pull while a large model is keep-alive resident.

FailoverError: LLM request timed out

Default OpenClaw per-provider timeout too short. Raise models.providers.ollama.timeoutSeconds to 600 (Step 8).

Hard power-off during sustained inference, no kernel log

Silicon-level thermal cutoff. GPU pinned to high instead of auto. Switch to auto in llm-tuning.service (Step 6) and reboot.

sqlite-vec unavailable warning from OpenClaw

Known upstream ABI mismatch between better-sqlite3 (SQLite 3.51) and the precompiled sqlite-vec binary (3.45). OpenClaw falls back to in-process cosine similarity. No suppression flag exists. Ignore.

Out of scope for this recipe

Public exposure — LAN-only configuration. Reverse proxy + auth + rate limiting are a separate procedure.
Multi-user OpenClaw — single-owner Telegram-bot configuration.
Vulkan as primary backend — if you want to compare ROCm vs Vulkan, build llama.cpp directly.
Air-gapped install — possible but requires offline mirroring of Ollama, OpenClaw, and the model.

Why this recipe exists

This is the local-LLM testbed configuration we run for AGLedger, a cryptographic notary for automated work. We needed a reproducible, on-premises, near-frontier-quality LLM environment to test agent accountability flows against.

The full story — failure modes, before/after benchmarks, and the older-guides-vs-26.04 comparison — is in the companion blog post: Near Frontier-Quality LLM, No Cloud, No Subscription, Unlimited Tokens.

If you are running local agents and want a tamper-evident chain of every turn they take, AGLedger Developer Edition is free and fully unlocked, runs offline, no phone home.