← Recipes

Recipe · verified 2026-05-06

Engineering recipe

gpt-oss:120b on Strix Halo + Ubuntu 26.04

Agent-readable version

Plain markdown at /recipes/local-llm-strix-halo-ubuntu-26-04.md. Fetch with one curl, ingest with any LLM tool, license CC0.

curl -O https://agledger.ai/recipes/local-llm-strix-halo-ubuntu-26-04.md

What this recipe produces

Prerequisites

Hardware

Software

Software versions verified

Ubuntu 26.04 LTS (Resolute Raccoon)

kernel 7.0.0-15-generic

Mesa / RADV 26.0.3-1ubuntu1

amdgpu DRM driver 3.64.0

Ollama 0.22.1

OpenClaw 2026.5.4

Node.js 24.15.0

Step 1 — BIOS: reduce UMA Frame Buffer to minimum

At boot, enter the BIOS setup utility (typically Delete or F2 on the EVO-X2). Find the UMA Frame Buffer Size setting (under Memory or Advanced/AMD CBS depending on platform).

Set to the BIOS minimum. On the GMKtec EVO-X2, that minimum is 2 GB. Save and reboot.

Verify after boot:

sudo dmesg | grep "of VRAM memory ready"
# Expect: amdgpu 0000:c4:00.0:  2048M of VRAM memory ready

On a unified-memory APU, BIOS-reserved VRAM is invisible to Linux as system RAM. The 32 GB default reserves 32 GB you cannot use for the model. Reducing to 2 GB frees ~28 GB for system + GPU use through the GTT pool.

Step 2 — Kernel command line: set the parameter trio

sudo cp /etc/default/grub /etc/default/grub.bak.$(date +%s)
sudo sed -i 's|^GRUB_CMDLINE_LINUX_DEFAULT="[^"]*"|GRUB_CMDLINE_LINUX_DEFAULT="ttm.pages_limit=23068672 amdgpu.no_system_mem_limit=1"|' /etc/default/grub
sudo update-grub
sudo systemctl reboot

23068672 pages × 4 KB = 88 GiB GPU-allocatable cap. amdgpu.no_system_mem_limit=1 removes a separate AMDGPU-level cap on system memory pinning.

Verify after reboot:

cat /proc/cmdline | tr ' ' '\n' | grep -E 'ttm|amdgpu'
# Expect:
#   ttm.pages_limit=23068672
#   amdgpu.no_system_mem_limit=1

cat /sys/module/ttm/parameters/pages_limit
# Expect: 23068672

dmesg | grep "of GTT memory ready"
# Expect: amdgpu 0000:c4:00.0:  90112M of GTT memory ready.

The kernel auto-detects how much GTT exists, but does NOT auto-bump the allocation cap past the BIOS VRAM slice. Older guides recommend amdgpu.gttsize= instead; that parameter is deprecated and ignored on kernel 7.0.

Step 3 — Install Ollama

curl -fsSL https://ollama.com/install.sh | sudo sh

The installer detects AMD via lspci, pulls the ROCm-tagged tarball with bundled gfx1151 kernels, registers the user with the ollama group, and starts a systemd service.

Verify:

ollama --version
# Expect: 0.22.1 or newer

systemctl is-active ollama
# Expect: active

sudo journalctl -u ollama --since '1 minute ago' | grep "library=ROCm"
# Expect: ...library=ROCm compute=gfx1151 ... type=iGPU

Ollama 0.22.1+ ships its own ROCm runtime. You do NOT need a separate ROCm install, do NOT need HSA_OVERRIDE_GFX_VERSION, and do NOT need to force Vulkan.

Step 4 — Configure Ollama for LAN access and large context

sudo mkdir -p /etc/systemd/system/ollama.service.d
sudo tee /etc/systemd/system/ollama.service.d/override.conf <<'EOF'
[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"
Environment="OLLAMA_KEEP_ALIVE=30m"
Environment="OLLAMA_CONTEXT_LENGTH=131072"
Environment="OLLAMA_FLASH_ATTENTION=1"
Environment="OLLAMA_KV_CACHE_TYPE=q8_0"
EOF

sudo systemctl daemon-reload
sudo systemctl restart ollama

Always pin OLLAMA_CONTEXT_LENGTH explicitly. The auto-context picker over-shoots on this chip (62 GiB available → defaults to 262144 even for models with n_ctx_train=131072). Flash Attention + Q8 KV cache cuts per-token KV cost to ~22 KB, making 128K context comfortable.

Step 5 — Pull gpt-oss:120b

ollama pull gpt-oss:120b
# ~65 GB MXFP4 download

Do NOT pull while another large model is keep-alive resident. Concurrent VRAM pressure plus heavy disk I/O has produced ROCm SIGSEGVs on this chip. Symptom: llama runner process has terminated: %!w(<nil>).

Step 6 — Persistent sysctl + sysfs tuning

echo "vm.swappiness=10" | sudo tee /etc/sysctl.d/99-llm-tuning.conf
sudo sysctl -p /etc/sysctl.d/99-llm-tuning.conf

sudo tee /etc/systemd/system/llm-tuning.service <<'EOF'
[Unit]
Description=LLM tuning: transparent huge pages + AMD GPU power level
After=multi-user.target

[Service]
Type=oneshot
ExecStart=/bin/sh -c "echo always > /sys/kernel/mm/transparent_hugepage/enabled"
ExecStart=/bin/sh -c "echo always > /sys/kernel/mm/transparent_hugepage/defrag"
ExecStart=/bin/sh -c "for f in /sys/class/drm/card*/device/power_dpm_force_performance_level; do [ -w \"$f\" ] && echo auto > \"$f\"; done"
RemainAfterExit=true

[Install]
WantedBy=multi-user.target
EOF

sudo systemctl daemon-reload
sudo systemctl enable --now llm-tuning.service

Use auto, not high. Pinning to high removes cooling time between inference bursts and has produced silicon-level thermal cutoffs (hard power-off, no kernel log) on small-form-factor chassis under sustained agent traffic. The kernel's auto governor still ramps the GPU to max during inference; it just relaxes between turns. Same throughput, lower thermal risk.

Step 7 — Smoke test: load and run gpt-oss:120b

ollama run gpt-oss:120b "Reply with exactly two words." --verbose
ollama ps

Expect from ollama ps:

NAME            ID              SIZE     PROCESSOR    CONTEXT    UNTIL
gpt-oss:120b    a951a23b46a1    68 GB    100% GPU     131072     29 minutes from now

Critical: PROCESSOR must show 100% GPU.

Any non-100% split (e.g. 11%/89% CPU/GPU) is a configuration failure, not a tradeoff. Partial CPU offload has been measured at 0.27 tok/s vs 35.5 tok/s when fully on GPU — ~130× difference. If you see a split, recheck Steps 1-2.

Step 8 — (Optional) Install OpenClaw agent gateway

OpenClaw bridges messaging channels (Telegram, Slack, Discord, iMessage, WhatsApp, Signal, others) to the local Ollama backend.

# Node 24 LTS via NodeSource
curl -fsSL https://deb.nodesource.com/setup_24.x | sudo bash -
sudo apt install -y nodejs

# OpenClaw
sudo npm install -g openclaw@latest
openclaw onboard \
  --non-interactive --accept-risk \
  --auth-choice ollama \
  --install-daemon \
  --gateway-bind loopback --gateway-auth token

openclaw config set agents.defaults.model.primary "ollama/gpt-oss:120b"

# CRITICAL: raise the per-provider request timeout
cat <<'PATCH' | openclaw config patch --stdin
{ "models": { "providers": { "ollama": { "timeoutSeconds": 600 } } } }
PATCH

# Recommended on a memory-tight box: remove fallback prewarming
openclaw config unset agents.defaults.model.fallbacks

# User-scope service across logouts
sudo loginctl enable-linger $USER
systemctl --user enable --now openclaw-gateway

Why timeoutSeconds: 600 is critical: OpenClaw's default per-provider timeout (~140 seconds) is too short for gpt-oss:120b's reasoning phase. Reasoning models stream delta.reasoning tokens with no delta.content until they finish thinking, and the gateway waits for content. Symptom: FailoverError: LLM request timed out. Raising to 600 s eliminates the false failovers.

Why we remove fallback prewarming: OpenClaw's model-prewarm sidecar loads ALL configured fallbacks at gateway startup, not just on actual failover. On a 88 GB-cap iGPU with a 65 GB primary, prewarming a 12 GB fallback eats most of the headroom you tuned the rest of the system to keep.

Step 9 — End-to-end verification

# Direct Ollama probe (model warm) — should return in ~1.2 s
curl -s http://127.0.0.1:11434/api/generate -d '{
  "model": "gpt-oss:120b",
  "prompt": "Reply with the single word: READY",
  "stream": false
}' | grep -oE '"response":"[^"]*"'

# OpenClaw gateway probe (warm) — should return under 4 s
time openclaw infer model run --gateway --model ollama/gpt-oss:120b \
    --prompt "Reply with the single word: READY"

# Confirm 100% GPU and 128K context
ollama ps

# Optional: install lm-sensors for ongoing thermal monitoring
sudo apt install -y lm-sensors
yes "" | sudo sensors-detect --auto
sensors

Benchmark targets

Run bench.py to compare. Expect within ~10% of these numbers on a properly configured 96 GB box:

ModelResidentCodeReasoningArchitecture
gemma410 GB54.3 tok/s52.0 tok/sDense
gpt-oss:20b13 GB48.7 tok/s47.4 tok/sDense, reasoning
gpt-oss:120b65 GB35.5 tok/s34.9 tok/sMoE, ~5B active
llama3.3:70b57 GB5.1 tok/s5.1 tok/sDense

Troubleshooting

unable to allocate ROCm0 buffer

TTM cap is set, but amdgpu.no_system_mem_limit=1 is missing OR the BIOS UMA reduction was not applied. Check both. The trio (Steps 1+2) is non-optional for gpt-oss:120b-class loads.

ollama ps shows 11%/89% CPU/GPU

Not enough headroom. In order of likelihood: (1) BIOS UMA Frame Buffer is still at 32 GB default; (2) OLLAMA_CONTEXT_LENGTH too high; (3) another model is keep-alive resident.

llama runner process has terminated

ROCm SIGSEGV under concurrent VRAM pressure. Avoid running ollama pull while a large model is keep-alive resident.

FailoverError: LLM request timed out

Default OpenClaw per-provider timeout too short. Raise models.providers.ollama.timeoutSeconds to 600 (Step 8).

Hard power-off during sustained inference, no kernel log

Silicon-level thermal cutoff. GPU pinned to high instead of auto. Switch to auto in llm-tuning.service (Step 6) and reboot.

sqlite-vec unavailable warning from OpenClaw

Known upstream ABI mismatch between better-sqlite3 (SQLite 3.51) and the precompiled sqlite-vec binary (3.45). OpenClaw falls back to in-process cosine similarity. No suppression flag exists. Ignore.

Out of scope for this recipe

Why this recipe exists

This is the local-LLM testbed configuration we run for AGLedger, a cryptographic notary for automated work. We needed a reproducible, on-premises, near-frontier-quality LLM environment to test agent accountability flows against.

The full story — failure modes, before/after benchmarks, and the older-guides-vs-26.04 comparison — is in the companion blog post: Near Frontier-Quality LLM, No Cloud, No Subscription, Unlimited Tokens.

If you are running local agents and want a tamper-evident chain of every turn they take, AGLedger Developer Edition is free and fully unlocked, runs offline, no phone home.

Sources & further reading

/recipes/local-llm-strix-halo-ubuntu-26-04.md — agent-readable plain markdown of this recipe

Companion blog post — the story, benchmarks, and full configuration walk-through

bench.py — stdlib-only Python benchmark script

OpenClaw on GitHub

Ollama OpenClaw integration docs

gpt-oss:120b model card

Linux kernel amdgpu driver documentation

Mesa 26.0 release notes (RADV STRIX_HALO)

Related