Local LLM benchmarks on AMD Strix Halo

Multi-backend benchmark results for gfx1151, comparing RADV, AMDVLK, and ROCm across performance, creative writing, and coding tasks.

Ryzen AI MAX+ 395 • Radeon 8060S • 128GB UMA • RDNA 3.5

b8638

llama.cpp build

Backends tested

26+

Models tested

Benchmark suites

Gemma 4 Family New

Google's full Gemma 4 lineup, all tested as Unsloth UD-Q4_K_XL on RADV.

Model	Arch	Size	pp512 (t/s)	tg128 (t/s)	Writing /30	Polyglot /65	Postgres /57	Cassandra /56
Gemma-4-E2B	Dense (4.6B)	2.9 GiB	3382	109	18	7	27	27
Gemma-4-E4B	Dense (7.5B)	4.7 GiB	1828	59	23	3	26	26
Gemma-4-26B-A4B	MoE (4B active)	16 GiB	1196	52.9	23	10	44	38
Gemma-4-31B	Dense	17.5 GiB	247	11.1	26	14	46	39

The E2B is the fastest model I've tested on this hardware. At 109 tokens/second and under 3 GiB, it scores a perfect 59/59 on LeetCode hard problems. The E4B matches the much larger 26B-A4B MoE on creative writing (23/30) while fitting in 4.7 GiB. The 26B-A4B MoE remains the best balance of speed and quality, and the dense 31B still writes the best prose of the group at 26/30. On Cassandra, Gemma 4 leads the entire field. The dense 31B tops the leaderboard at 39/56 with 9/10 on CQL query writing. The E4B also scores 9/10 on T1, matching the 31B despite being a 7.5B model. The 26B MoE gets a perfect 10/10 on anti-pattern detection, the only model to do so across all 14 tested.

Qwen 3.5 Family

Alibaba's MoE and dense lineup, tested with Unsloth-recommended sampling params (temp 0.6, top_p 0.95, thinking disabled).

Model	Arch	Size	pp512 (t/s)	tg128 (t/s)	Writing /30	Polyglot /65	Postgres /57	Cassandra /56
Qwen3.5-4B	Dense	4 GiB	1375	38	16	3	17	16
Qwen3.5-9B	Dense	5.6 GiB	972	36	20	5	28	22
Qwen3.5-27B	Dense	16 GiB	310	12	25	10	34	29
Qwen3.5-35B-A3B	MoE (3B active)	21 GiB	1017	60	28	8	32	33
Qwen3.5-122B-A10B	MoE (10B active)	72 GiB	129	21	29	7	33	33

The 35B-A3B MoE is the standout for speed and creative writing. At 60 t/s it runs 5x faster than the dense 27B while scoring higher on writing (28/30 vs 25/30). All models require Unsloth's recommended sampling params or they produce garbage. The dense 27B leads the family on PostgreSQL (34/57 vs 32/57 for the MoE), driven by T1 complex SQL where the MoE's 3B active params fall short. On Cassandra, the 35B MoE and 122B tie at 33/56. The MoE gets there on T1 CQL writing (8/10) and T4 procedural (13/30), while the 122B compensates with stronger T2 anti-pattern detection (8/10 vs 6/10). Both trail Gemma's 39/56 at the top. The 9B is the only model in either family to solve PostgreSQL recursive cycle detection. RLS defeats everyone.

MiniMax-M2.5 New

I tested this model back in February with an IQ3_XXS quant and llama.cpp defaults. It scored 0/10 on coding (missing imports), 24/30 on writing, and I shelved it. Unsloth's Dynamic 2.0 UD-Q3_K_XL quant and their recommended sampling params (temp 1.0, top_p 0.95, min_p 0.01, top_k 40) changed everything. Coding went to 10/10, LeetCode to 59/59, writing to 26/30. The previous 0/10 was a quant quality and sampling configuration problem, not a model capability problem.

At 229B total and roughly 10B active parameters, it eats 94 GiB. The largest model on disk by a wide margin. AMDVLK is the right backend here, giving 32 t/s generation vs 22 on RADV. That 32 t/s translates to about 8 words/second on writing tasks, which is usable but not fast. PostgreSQL landed at 40/57 with perfect 10/10 on T2 query optimization and 19/31 on T4 procedural. Cassandra scored 30/56, also with perfect 10/10 on T2 anti-pattern detection. Only the second model to ace Cassandra T2 after Gemma-4-26B-A4B. Combined 185/285 puts it 6th overall.

The model has a quirk that cost me several hours to track down. Its chat template hardcodes a <think> token in the generation prompt with no toggle to disable it. The --reasoning-budget 0 flag has no effect. I built a custom Jinja template that strips the thinking prefix, but it turns out the model is equally verbose without it. For the polyglot gauntlet I had to bump context from 32K to 64K because the thinking block was eating half the token budget and truncating code output. That got the polyglot score from 6/65 to 13/65, mostly by rescuing the bash log analyzer (0/8 to 6/8). The Ruby and Go challenges still truncate even at 64K.

PostgreSQL Benchmark New

I added a database benchmark that tests real PostgreSQL skills against a live Postgres 18 instance. A fresh container spins up per run with 2.7 million rows of seeded e-commerce data. 28 challenges across four tiers, 57 test points total. Tier 1 tests complex SQL writing (window functions, recursive CTEs, LATERAL JOINs, JSONB aggregation). Tier 2 tests query optimization where the model receives a slow query and its EXPLAIN plan, then has to produce a fix that actually reduces cost against the live data. Tier 3 tests DBA diagnosis from simulated pg_stat_activity and EXPLAIN output. Tier 4 is the hard one. Models write PL/pgSQL functions, triggers, exclusion constraints, RLS policies, and batch procedures with savepoint error handling. Their code gets executed and validated through behavioral test cases.

I ran the full Gemma 4 family through it. The dense 31B leads at 46/57, followed closely by the 26B-A4B MoE at 44/57. The E2B and E4B small models land at 27 and 26 respectively. Query optimization (T2) turned out to be the great equalizer. Even the 2B model scores 8/10, correctly suggesting GIN indexes, partial indexes, and covering indexes. Tier 4 is where model size actually matters. The 31B and 26B both score 22/31, acing exclusion constraints, audit triggers, partition management, and batch savepoints. The small models collapse to 7-10/31 on the same challenges. Row-level security and recursive cycle detection defeated the entire family at 0/5 and 0/4.

Cassandra Benchmark New

I built a Cassandra benchmark to test whether models that handle PostgreSQL can also reason about distributed databases. A 3-node Cassandra 5.0 cluster spins up on Strix per run, seeded with 1.5 million rows of IoT sensor data across 8 tables. 32 challenges across four tiers, 56 test points total. The fundamental question is whether the model understands that Cassandra is not SQL with different syntax. Every partition key must be in the WHERE clause. There are no JOINs. ORDER BY only works on clustering columns. Deletes create tombstones. Consistency is per-operation. A model trained on relational patterns will fail even if it knows CQL keywords.

Tier 2 is the most interesting tier and the hardest departure from PostgreSQL. Cassandra has no EXPLAIN, so there is no query optimizer to reason about. Instead, the model receives a broken schema with symptoms and has to identify the anti-pattern. Unbounded partition growth, hot partitions, secondary index scatter-gather, tombstone accumulation from queue patterns, ALLOW FILTERING abuse, compaction strategy mismatches, consistency level math, batch misuse, materialized view pitfalls, and collection size limits. Gemma-4-E2B scored 7/10 here. A 2B model explaining why a logged multi-partition batch is slower than individual async inserts, or why a secondary index on email causes fan-out to all 12 nodes. These concepts need distributed systems understanding, not coding ability.

Tier 4 executes model code against the live cluster. Time-bucketed schema design with TWCS compaction, distributed locks via lightweight transactions, materialized views, counter-based rate limiters, batch denormalization, and Storage-Attached Indexes. The 26B-A4B and 31B both score 14-15/30, acing LWT locks and counters but failing materialized views and batch denormalization. The E2B actually outperforms the E4B here (10 vs 4), perfectly solving both the LWT lock and counter challenges while the 4B model can't produce valid CQL for any of them.

T1 CQL writing scores range from 2/10 (Qwen3.5-4B) to 9/10 (Gemma-4-31B and E4B). The E4B result stands out: a 7.5B dense model matching the 31B on CQL syntax while most MoE models with more total parameters score 7-8/10. TOKEN() range scans, GROUP BY semantics, and PER PARTITION LIMIT with multi-column clustering keys are the remaining unsolved patterns. The dense 31B leads the overall leaderboard at 39/56, edging the 26B MoE (38/56) by one point, breaking the tie that exists on PostgreSQL.

Polyglot Coding Gauntlet

The LRU cache benchmark ran its course. 15 models score a perfect 10/10, making it useless for differentiation. I built a new polyglot coding gauntlet with seven challenges across Python, Bash, Ruby/Sinatra, and Go, totalling 65 auto-graded tests at LeetCode medium/hard difficulty. The challenges test real-world skills like parsing structured logs with nested JSON, matching cron expressions, building awk pipelines, implementing FastAPI rate-limiting middleware, writing HMAC-verified Sinatra webhook handlers, and building recursive Go rule engines.

I ran every model on disk through the gauntlet. Nobody breaks 22%. A three-way tie leads at 14/65 (GPT-OSS-120B, GPT-OSS-20B, and Gemma-4-31B), all winning on bash pipelines. The 20B matches its 120B sibling identically in 3 minutes vs 8, proving the extra 100B parameters add zero value here. Qwen3-30B is right behind at 13/65 and is the only model to score on the Go rule engine (3/10), finishing in 90 seconds. Three challenges remain completely unsolved across all 14 model runs. The cron matcher, the FastAPI rate limiter, and the Sinatra webhook processor all sit at 0/10.

Two clear capability profiles emerged. The Qwen3 family (30B, Coder-Next, 27B dense) excels at Python regex, scoring 9-10/10 on the structured log parser, but completely fails bash extraction. GPT-OSS, Gemma, and the Qwen3.5 MoE family handle bash pipelines well but are weak on regex. Dense models outperform their MoE siblings on structured output. Gemma-4-31B (dense, 14/65) beats the 26B-A4B MoE (10/65) despite being 5x slower. Thinking mode actively hurts. Qwen3.5-9B jumped from 1 to 5 points when thinking was disabled. Model size doesn't correlate with score. The 30B outperforms the 122B, and Nemotron-Cascade-2's IMO gold medal reasoning scored just 1/65.

I also tested whether quantization precision matters by running Gemma-4-26B-A4B at both Q4_K_XL and Q8_0. It doesn't. Run-to-run variance exceeds the quant quality difference when you only have 3.8B active parameters. The benchmark measures model capability ceilings, not quant artifacts.

Hardware

CPU	AMD Ryzen AI MAX+ 395 (16C/32T)
GPU	Radeon 8060S Graphics (RDNA 3.5, gfx1151)
Memory	128GB unified (120GB soft VRAM cap)
Kernel	6.17.0-8-generic
Vulkan	1.4.321 (Mesa RADV 25.2.3 + AMDVLK 2025.Q2.1)
ROCm	TheRock 7.11.0a20260121 nightly
llama.cpp	b8638 (2026-04-02)

Three backends, one GPU

RADV

Mesa's open-source Vulkan driver. Best overall since b8119 MMQ fix with a 25% pp boost on Qwen3-30B. Recommended default.

AMDVLK

AMD's open-source Vulkan driver. Best for GPT-OSS-120B prompt processing. Good stability for production use.

ROCm

AMD's compute stack via TheRock nightly. Requires -mmp 0. Best for some dense models (GLM-4.7, Qwen3.5-27B).

Explore results