Local LLM benchmarks on AMD Strix Halo

Multi-backend benchmark results for gfx1151, comparing RADV, AMDVLK, and ROCm across performance, creative writing, and coding tasks.

Ryzen AI MAX+ 395 • Radeon 8060S • 128GB UMA • RDNA 3.5

b9553

llama.cpp build

Backends tested

32+

Models tested

Benchmark suites

Ornith-1.0-35B New

Ornith-1.0-35B is the newest result on the board and the most interesting one in a while. It is DeepReinforce's reinforcement-learning fine-tune of Qwen3.5-35B-A3B aimed squarely at agentic coding, released late June 2026 under an MIT license. At Q8_0 it lands Combined 227/285, second overall behind Gemma-4-31B at 238, and it is an always-on reasoning model so I ran the whole suite thinking-on. What makes it worth a spotlight is the comparison built into its own family. I have the base Qwen3.5-35B-A3B on record at 192 and a different agentic fine-tune, Nex-N2-mini, at 183. Ornith clears both by a wide margin, and because I ran it the same thinking-on way I ran Nex, the 44-point gap between them is a clean measure of what good reinforcement training does to a base.

Model	Writing /30	LeetCode /59	Polyglot /65	Postgres /57	Cassandra /56	Halluc. /22	Combined /285
Qwen3.5-35B-A3B (base)	28	59	17	32	38	21	192
Nex-N2-mini (agentic)	28	53	17	38	29	12	183
Ornith-1.0-35B New	28	59	48	46	28	22	227

The whole story is in the coding columns. Polyglot jumps from 17 on both the base and the Nex tune to 48 here, the best local score on the board and four clear of Gemma-4-31B. Ornith is the first local model to solve the cron matcher at a full 10/10 and the first to handle the sliding-window rate limiter at 9/10, two challenges that had defeated every model before it. Postgres climbs to 46/57 on perfect query optimization and 25 of 31 procedural points. The gains do not reach everywhere. Cassandra slips to 28/56 with Tier 4 procedural CQL collapsing to 7 of 30, the same dialect-specific syntax errors the Nex tune made, so the reinforcement training that taught it Python and Go did nothing for Cassandra DDL. It also returned a perfect 22/22 on the hallucination calibration set, matching the base rather than beating it, and a board-topping 77/80 on the prose constraint test where it held first-person present tense across the whole chapter without a single stray colon or dash. It also took the top raw score on the automation reliability bench at 61/65, though thinking-on pushed it to 43 seconds per call, so it wins the point total without being the model I would wire into a real pipeline.

Agentic coding New

Every other benchmark here is single-shot. I wanted to know something a single prompt can't tell me, whether a model can actually drive a coding agent through a long chain of tool calls. So I had pi build a terminal Sokoban game phase by phase, five accreting phases, scored two ways, across six models including gpt-5.5 as the frontier line. The result surprised me. The hard algorithm I built as the wall, a Sokoban solver, was the easy part. Every model wrote a working search. What separated them was keeping their own growing code coherent across the whole build. The first pass put the local Gemma-4-31B ahead of gpt-5.5, so I ran the top three models three times each, and the ranking came apart. The most consistent agent was the local Qwen3.6-27B, clean in every trial, while the 31B's flawless run turned out to be one draw of three. The full writeup covers the harness, the multi-tool-calling setup, MTP under an agentic load, and why a single agentic run can't rank the top models.

Gemma 4 MTP New

Multi-Token Prediction landed for Gemma 4 in mainline llama.cpp on June 7 (PR #23398), and it works differently from the Qwen3.6 version I covered earlier. Qwen ships the draft head inside the main GGUF. Gemma 4 uses a separate assistant drafter file of about 0.45 GiB that you load alongside the model with --model-draft assistant.gguf --spec-type draft-mtp. The drafter shares the main model's KV cache. I added support for it to my launcher, downloaded drafters for all five Gemma 4 variants on disk, and ran the numbers.

The question that matters for speculative decoding is whether it costs anything. To answer it cleanly I ran the full tracked suite twice on the same build 9553, once against the plain 31B and once with the drafter at n=4, so the only variable between the two runs is MTP itself. The plain run reproduced my earlier tooling score and per-call latency almost exactly, which rules out build drift. The whole suite finished in 36 minutes with the drafter against 72 minutes without it, and the scores did not move.

Benchmark	Plain	MTP	Quality shift	Speed
FastAPI	8/8	8/8	0	3.19x
LeetCode	59/59	59/59	0	2.93x
Polyglot	54/65	49/65	-5 (noise)	2.50x
PostgreSQL	49/57	47/57	-2 (noise)	2.18x
Coding (LRU)	10/10	10/10	0	2.13x
Tooling	54.5/65	53.5/65	-1 (noise)	2.07x
Cassandra	38/56	38/56	0	1.63x
Creative writing	~27/30	~27/30	0	1.55x

The scores hold because MTP cannot change them, and I read the acceptance code to be sure. Every token that gets emitted is sampled from the main model's own distribution with its full temperature and top-p and top-k chain. The drafter only proposes candidates, and a proposal is kept only when it matches what the main model would have drawn anyway. On any mismatch the loop stops and uses the main model's token. The output is the same distribution you get without a drafter, so a run with MTP and a run without it differ only in their random draws, the way two seeds differ. The three deterministic benchmarks come back identical and tooling reliability matches to the decimal. PostgreSQL and Polyglot are the two highest-variance partial-credit benchmarks on the board, and Polyglot alone has swung 19 to 35 across repeat runs of one model, so a single-run drop of a few points there is its own noise rather than anything the drafter did.

The speedup depends entirely on what the model is generating. Structured output with predictable continuations wins biggest, with FastAPI at 3.19x and LeetCode at 2.93x, because code idioms and signatures are easy to draft and most drafts get accepted. Creative writing at temperature 1.0 wins least at 1.55x, because high-entropy prose breaks almost every draft. Tooling sits in the middle at 2.07x for a different reason. Those calls feed long articles in and ask for short summaries out, and MTP only accelerates generation, so the prompt-processing half of each call sees no benefit. The headline is workload-shaped rather than a single number.

I swept all five Gemma 4 variants to find the right draft depth per model.

Model	Baseline tok/s	Best tok/s	Speedup	Best n
Gemma-4-12B (dense)	13.74	33.49	2.44x	4
Gemma-4-31B (dense)	10.35	24.50	2.37x	4
Gemma-4-31B QAT (dense)	10.95	25.08	2.29x	4
Gemma-4-26B-A4B (MoE)	42.33	59.36	1.40x	2
Gemma-4-26B-A4B QAT (MoE)	61.22	83.18	1.36x	2

The dense models gain 2.3 to 2.4x and want a draft window of n=4. The MoE pairs gain about 1.4x and peak at n=2, declining with deeper drafts. That MoE gain stands out because the PR author and several CUDA users reported no MoE speedup at all. On gfx1151 the decode step is memory-bandwidth-bound, and the drafter's extra forward pass is cheap against the wait on weight bandwidth, so even the compute-light MoE has slack for MTP to fill. The QAT drafters accept a few points lower than the others because they were trained against unquantized weights, though it does not dent the speedup. Every Gemma 4 model on Strix now runs with its drafter on by default, the dense ones at n=4 and the MoE ones at n=2.

Google pushed a Gemma 4 chat-template fix the morning after these runs. I re-ran tooling and creative writing with that template overriding the one baked into the GGUF, and the tooling result came back byte for byte identical at 53.5/65 and 13.23 seconds per call. The fix only touches tool-call handling and multi-turn continuation, neither of which a single-turn run exercises, so the numbers above stand. It will matter once I add a tool-use or multi-turn benchmark.

Tooling benchmark New

I added a tenth suite that answers a question the other nine do not. Most of my actual local model usage is unattended automation, scripts that summarize articles into strict JSON, sort transactions into fixed categories, and pull dates and amounts out of messy text. Those pipelines call a model dozens of times per run, parse the output with code, never retry, and silently drop anything that fails to parse. For that kind of work, peak quality matters less than whether call 47 of 65 comes back machine-readable.

The tooling benchmark runs 19 challenges as 65 separate calls across four tiers. Strict JSON output against real pipeline prompts, extraction and classification with exact ground truth, plain-text format contracts, and a robustness tier covering prompt injection, deliberately altered facts, and refusals on harmless input. Every challenge runs multiple trials at temperature 0.7, and the headline numbers are reliability rates and per-call latency rather than a single best answer.

Eight runs are on the board now. Ornith-1.0-35B holds the highest raw total at 61/65, but it ran thinking-on at 43 seconds per call, roughly eight times slower than anything else here, with strict parsing down at 87.7 percent, so the point total oversells it for unattended work. Among the models you would actually wire into a pipeline, Qwen3.6-27B leads at 60.5/65 with every one of its 65 calls machine-usable and 96.9 percent parsing with zero recovery. Google's day-zero Gemma-4-31B QAT Q4_0 sits second at 59.5, beating the Unsloth quant of the same model by five points at the same file size and posting the best strict parse rate measured so far at 98.5 percent. Gemma-4-26B-A4B at Q8_0 takes third at 59, and its QAT sibling trades five points for half the footprint and the fastest pace on the bench at 4.7 seconds per call. The recurring point sink across every model is the same, summaries that run just short of a stated word minimum while still parsing cleanly, which is why the reliability column and the score column disagree.

The robustness tier keeps producing the best findings. The 26B-A4B at Q8_0 fell for an embedded injection outright, returning a summary field containing exactly the payload a fake system message inside the article demanded while writing correct key points about the real story in the same response. The 35B-A3B bent a different way, keeping its summary clean but assigning exactly the importance score the injection asked for. Both Qwen models imported the word diesel into a ferry story whose source never mentions fuel, and no Gemma did. Full details on the methodology page, reference-quant scores on the models page, and the QAT comparison on the quantization page.

Gemma-4-12B

Gemma-4-12B was the newest result on the board until Ornith arrived. Released 2026-06-03 and run at Unsloth UD-Q8_K_XL, it lands Combined 222/285, third overall behind Gemma-4-31B at 238 and Ornith-1.0-35B at 227. It uses 13.6 GiB and posts the strongest score per parameter I have measured on this hardware.

Model	Arch	Size	pp512 (t/s)	tg128 (t/s)	Writing /30	Polyglot /65	Postgres /57	Cassandra /56
Gemma-4-E2B	Dense (4.6B)	2.9 GiB	3382	109	18	10	27	27
Gemma-4-E4B	Dense (7.5B)	4.7 GiB	1828	59	23	13	35	34
Gemma-4-26B-A4B	MoE (4B active)	16 GiB	1196	52.9	23	22	49	39
Gemma-4-12B New	Dense (12B)	13.6 GiB	716	14.0	28	32	47	38
Gemma-4-31B	Dense	17.5 GiB	261	11.1	27	44	51	39

The 12B result changes the Gemma lineup. It writes 28/30, one point above the 31B, with all five required beats, distinct character voices, and a clean opening. Postgres is 47/57, third best on the board, with perfect diagnosis and 26/31 on procedural tasks. Cassandra is 38/56, tied with the 26B-A4B and one behind the 31B. Polyglot is 32/65 with a perfect Go rule engine and a partial cron solve at 6/10.

Speed is the only weak point. At Q8 a dense 12B is memory-bandwidth-bound here, so generation stays around 14 to 15 tokens per second across all three backends. ROCm leads prompt processing at 928 against RADV's 716, but generation barely moves. A Q4 build would roughly double throughput at some quality cost, so Q8 stays the quality reference until a lower-precision run proves itself.

The rest of the family still matters. The E2B is the fastest model I have tested at 109 tokens/second and under 3 GiB. The 26B-A4B MoE leads Gemma on Cassandra tie-breaks and posts 49/57 on PostgreSQL. The dense 31B still owns the top combined score at 238/285, but the new 12B is close enough to be the practical default for quality testing.

Mainline MTP for Qwen3.6

am17an's Multi-Token Prediction PR (#22673) merged into mainline llama.cpp on May 16, build 9191. Mainline now ships MTP support for the Qwen3.5 and Qwen3.6 families. The Gemma-4 MTP work (PR #22738) was closed without merging and is not coming back, so Qwen is the only family this matters for today. I pulled the new build, downloaded the Unsloth MTP-enabled GGUFs for both the 27B dense and the 35B-A3B MoE, and ran a baseline pass plus draft-mtp at n=2 and n=3 across the same nine-prompt suite the PR author used for the published numbers.

MTP loads the draft head from the same single GGUF file. There is no separate drafter to manage. Activate it with --spec-type draft-mtp --spec-draft-n-max N at server start. The model must be an MTP variant. The regular Unsloth GGUF loads fine but the spec flag silently does nothing because the nextn tensors are not present.

Model	Config	tok/s	Speedup	Draft accept
Qwen3.6-27B (dense)	baseline	11.58	1.00x	n/a
Qwen3.6-27B (dense)	MTP n=2	20.37	1.76x	79.7%
Qwen3.6-27B (dense)	MTP n=3	21.32	1.84x	72.3%
Qwen3.6-35B-A3B (MoE)	baseline	54.85	1.00x	n/a
Qwen3.6-35B-A3B (MoE)	MTP n=2	65.85	1.20x	77.7%
Qwen3.6-35B-A3B (MoE)	MTP n=3	66.95	1.22x	71.4%

The dense 27B is the story here. 1.84x token generation at n=3 with zero quality cost, because MTP is lossless draft-and-verify rejection sampling at the target distribution. The server-side baseline of 11.58 tok/s matches the 12.03 llama-bench reference within chat-template overhead. The win on the dense model comes from being memory-bandwidth-bound during decode. The MTP head's extra forward pass is cheap relative to the wait on weight bandwidth, and most drafts get accepted, so the saved verifies are nearly free wall time.

The MoE 35B-A3B gains only 1.22x. With 3B active parameters decode is already compute-light, MTP overhead eats most of the would-be savings, and the absolute baseline of 54.85 tok/s leaves less room to grow. The acceptance rate is still healthy at 71 to 78 percent. The mechanism works fine, the headroom is just smaller. If memory is tight the regular Unsloth GGUF (without the extra 0.42B MTP layer) is the cleaner choice. If the model is loaded anyway, leave MTP on.

Acceptance rate tracks prompt determinism, not model. Both models accept above 80 percent on summarize, code_python, and translation. Both drop below 55 percent on creative_short, which is a four-line poem about a lighthouse. High-entropy generation kills draft acceptance because every plausible continuation breaks the draft. The 27B holds 1.60x speedup even at 54 percent acceptance because the dense model has so much bandwidth slack to absorb the failed drafts. The MoE collapses to 1.05x on the same prompt. Per-prompt best at n=3 on the 27B is summarize at 25.58 tok/s (2.13x), on the 35B-A3B is summarize at 87.79 tok/s (1.48x).

n=3 beats n=2 in aggregate for both models. The wider draft window catches more tokens per accepted cycle, which more than compensates for the lower per-draft acceptance rate. Prompt processing was not measurably degraded at the standard -b 2048 server config across this nine-prompt suite. The PR warns of a D2H embedding-transfer penalty on PP, but the prompts are short enough that PP is not the bottleneck. ROCm with MTP is the open question. The ROCm build was last refreshed before the MTP merge, so a rebuild and rerun is the next step there.

Setup is llama-server -m Qwen3.6-27B-UD-Q4_K_XL.gguf -ngl 999 -b 2048 -t 32 -fa 1 -c 8192 --spec-type draft-mtp --spec-draft-n-max 3. RADV throughout, mainline llama.cpp at 4f13cb742 (build 9191). For the 27B this becomes my default writing config going forward. For the 35B-A3B it stays on by default since the model is loaded anyway, but the dense 27B with MTP is now the throughput-quality sweet spot on Strix.

Real-world workloads on Strix Halo

The 1.84x spec-bench number is the peak case across short prompts averaging 192 tokens. I ran two benchmarks where the gap actually shows up for users (long-form writing and the polyglot coding suite) to check how MTP holds up under realistic Strix Halo workloads with proper Qwen3.6 sampling.

Workload	Wall	tok/s	Quality
Creative writing (2K-word story)	186s	16.4 (1.37x baseline)	~28/30, all five required beats
Polyglot (7 challenges, mixed languages)	254s total	varies per challenge	43/65 (new high, +8 vs prior)

Creative writing is where MTP gives back the least, and that is exactly the workload most Strix Halo users care about. High-entropy prose has the lowest draft acceptance rate because every plausible next token breaks a draft, and the presence_penalty=1.5 that the Qwen3.6 family needs to avoid prose loops further depresses acceptance since the MTP head trained on the unmodified distribution. The 1.37x speedup falls well below the 1.84x peak, but on a 2294-word generation that is the difference between 186 seconds and 254 seconds. Quality held, with all five required beats landed in order and no looping.

Polyglot is where MTP gets out of its own way. Code has predictable continuations like function signatures, common idioms, and well-known boilerplate, so per-challenge draft acceptance lands above 70 percent. The bench scored 43/65 in 4.2 minutes wall time, eight points above the previous Qwen3.6-27B best. T2 Cron solved 10/10, which is unusual since T2 plus T5 Rate Limiter plus T6 Sinatra have been the three universal fails across nearly every model on this bench. The Cron win likely came from the corrected repeat_penalty=1.0 (Qwen3.6 Unsloth default) instead of the prior 1.1 (Qwen3 family default), not from MTP itself. Polyglot has documented single-run variance so 43 carries a one-run asterisk.

For practical Strix Halo use, dense Qwen3.6-27B is the model that benefits most from MTP because gfx1151 decode is memory-bandwidth-bound and the MTP head's extra forward pass is cheap relative to the bandwidth wait. The MoE 35B-A3B sees a smaller 1.22x gain because 3B-active decode is already compute-light and leaves no slack for MTP to absorb. For users whose primary workload is long-form writing or daily-driver coding, the configuration is Qwen3.6-27B + MTP n=3 + the Unsloth Qwen3.5-family sampling profile (presence_penalty=1.5 for writing, repeat_penalty=1.0 across the board). Expect 16 to 21 tok/s depending on workload, with no quality regression.

Mistral-Medium-3.5-128B

Mistral's 128B dense flagship, released March 31. Two upstream blockers cleared in late April. Unsloth uploaded the full GGUF quant set (UD-Q3_K_XL through Q8_0) and the YaRN parser bug that mangled long-context behavior was fixed by setting mscale_all_dim from 1 to 0 in the model config. Architecture is mistral3, the same family as the Mistral-Small-3.1 and 3.2 24B models llama.cpp already supports. Native 256K context, multimodal text and image input through the vision encoder, though I tested only the text path through llama-server. The Unsloth UD-Q4_K_XL split is 75.7 GB across three files and fits comfortably inside the 120 GB soft cap on Strix.

The reasoning interface differs from every other model on this bench. Mistral 3.5 reads a reasoning_effort chat-template kwarg with two settings, "none" and "high". The default "none" produces a direct response. The default enable_thinking: false kwarg the bench passes for Qwen and Gemma is silently ignored here. That detail matters for the LRU result. I ran the standard tracked battery once with default sampling (temp 0.6, top_p 0.95, repeat_penalty 1.0, presence_penalty 0.0) and re-ran the LRU prompt with reasoning_effort: high to get the proper coding score.

Model	Arch	Size	pp512 (t/s)	tg128 (t/s)	Writing /30	Polyglot /65	Postgres /57	Cassandra /56	Combined /285
Mistral-Medium-3.5-128B	Dense (125B)	70.5 GiB	78	3.0	30	30	39	34	207

Combined 207/285 lands the model 6th overall on the leaderboard, behind Gemma-4-31B (238), Ornith-1.0-35B (227), Gemma-4-12B (222), Gemma-4-26B-A4B (215), and Qwen3.6-27B (213), with Qwen3.6-35B-A3B (205) just below. Writing scores 30/30, joint top with Qwen3-30B-Instruct-2507 and Qwen3.6-27B. The story output runs 2082 words against the 2000-word target, reaches all five required beats, and carries a structural payoff that earns the score on craft rather than padding. The hatchling rescue at the heart of the scene parallels the patient grief introduced in the opening, and the model lands the parallel through dialogue (a flat "wasn't meant to" delivered over a struggling baby turtle) rather than narration. Em dash count stays inside budget, sensory layering covers salt, sand texture, light, and skin warmth without any of the AI-prose tells the bench scorecard treats as deductions. Generation cost was 1090.8 seconds (1.9 words per second on ROCm).

The LRU coding score requires explanation. The standard server bench scored 2/10 because the default no-reasoning path generated a put method that does time.time() + (ttl if ttl is not None else self.default_ttl), which TypeErrors when both arguments are None. Re-running the same prompt with reasoning_effort: high through chat-template-kwargs produced 18,606 characters of reasoning across 28 minutes and a different implementation that scores 7/10. The new failures are all the same conceptual bug in a different place. The eviction path runs del self.expiry[oldest_key] unconditionally, which KeyErrors when the evicted key was stored without an expiry entry. Two attempts, two different bugs, both centered on None handling in the expiry map. The 7/10 with reasoning is the recorded score because that mode is the model's intended high-effort coding profile per the Unsloth docs.

Polyglot scored 30/65 on a single run, which puts the model 3rd on this benchmark behind Gemma-4-31B (44) and the Qwen3.6 pair (35 and 30). T7 Go and T4 CSV both perfect (10/10 and 7/7). T2 cron, T5 rate limiter, T6 Sinatra all 0/10, the same three challenges that have stayed unsolved across nearly every tested model. Polyglot variance on this bench is well-documented (one earlier model produced 6, 17, 19, 26, 35 across five identical runs), so 30/65 carries a single-run asterisk pending re-bench. Postgres came in at 39/57. T3 diagnosis was perfect 6/6, T2 query optimization 9/10, T1 SQL writing 5/10, T4 procedural 19/31. The pattern is consistent. The model fixes broken queries and identifies operational problems with surgical accuracy and writes original complex SQL less reliably. Cassandra 34/56 with T2 anti-pattern detection at 9/10, the strongest non-Gemma result on that tier, plus T1 query writing 8/10. T4 procedural collapsed to 12/30 with broken CQL DDL on time-bucketed schema, materialized view, batch denormalization, and SAI index challenges.

ROCm is the right backend here. pp512 runs 78 on ROCm, 62 on RADV, 17 on AMDVLK. Token generation is essentially identical across all three (2.92, 2.90, 2.99) because at 125B dense the workload is pure memory bandwidth. Build 9029 for Vulkan, 8721 for ROCm.

Mistral-Small-4-119B

Mistral's first open mixture of experts, released March 16. 119B total parameters, 6B active, 128 experts with 4 routed per token, Apache 2.0. The design folds the old Magistral reasoning, Pixtral vision, and Devstral coding models into one set of weights. Architecture is mistral4, new to llama.cpp but already recognized by build 9518. The Unsloth UD-Q4_K_XL split is three files at 70 GB and fits well inside the 120 GB soft cap. I tested the text path through llama-server on RADV.

Model	Arch	Size	pp512 (t/s)	tg128 (t/s)	Writing /30	Polyglot /65	Postgres /57	Cassandra /56	Combined /285
Mistral-Small-4-119B	MoE (6B active, 128 experts)	69 GiB	363	40.3	21	24	27	26	174

The reasoning interface matches Mistral-Medium-3.5. The template reads a reasoning_effort kwarg with two settings, none and high, and the bench's default enable_thinking: false is ignored. I set reasoning_effort: high at the server for the coding and database suites, and that choice exposed a sharp edge. The first LRU run scored 0/10 with an empty response. At high effort the model wrote 18,000 characters of reasoning into the separate reasoning_content field and hit the 4096-token cap before a single line of code reached the answer. Raising the limit to 20480 tokens fixed it and LRU recovered to 10/10. Anyone benching this model at high effort needs the larger budget or the code never lands.

Pure coding is the strong suit. 10/10 LRU, a clean 59/59 LeetCode, and 24/65 polyglot, which clears the Gemma 26B MoE and sits in the upper middle of the field. The Go rule engine missed by one line, an unused loop variable that Go treats as a compile error and that zeroed all ten tests. Postgres reached 27/57 with Tier 2 optimization at 9/10 and Tier 3 diagnosis at 5/6, while Tier 1 lost points to careless slips like a numeric typed as numic and an interval compared against an integer. Cassandra came in at 26/56 after I made the JSON extractor lenient. The model identified every Tier 2 anti-pattern correctly but pretty-printed its CQL fixes as multi-line strings with raw newlines, which strict JSON parsing rejected. Switching to strict=False moved Tier 2 from 5 to 7.

Creative writing is the weak spot at 21/30. The story hits all five required beats and the sensory work is good, with sand clinging like ghosts and the gritty cold of the dawn beach. The prose leans hard on telling, naming emotions in italics rather than showing them, and Jack stays flat next to Maya's sharper voice. Two things drag the score below the competent low twenties. The model breaks character at the end with a full assistant sign-off that offers to revise the story, and it prints a fabricated 2,000-word count when the actual text runs 2,364. It also reworks the brief, making Jack's late wife the very patient Maya lost, a coincidence the two-strangers setup never asked for.

RADV is the backend here. It leads token generation at 40 tg and trails ROCm only on prompt processing (363 against 441 pp), and tg is what matters for writing. At 6B active the model decodes more than ten times faster than the dense Mistral-Medium-3.5 at 3 tg, which is the whole point of the MoE. Build 9518 for Vulkan, 9172 for ROCm.

Granite 4.1 Family

IBM released Granite 4.1 on April 29 in three plain dense sizes (3.4B, 8.8B, 28.9B), all Apache 2.0 with 131K native context. Architecture is unchanged from Granite 4.0. Decoder-only dense transformer with GQA, RoPE, MLP plus SwiGLU, RMSNorm, shared input and output embeddings. The 4.1 jump is post-training only. Improved supervised finetuning and reinforcement learning targeting tool calling, instruction following, and chat. No special sampling profile from IBM or Unsloth, so I ran the bench suite at llama.cpp defaults (temp 0.7, top_p 0.9, repeat_penalty 1.1) with the chat template enabled through --jinja. All three pulled from Unsloth UD-Q4_K_XL.

Model	Arch	Size	pp512 (t/s)	tg128 (t/s)	Writing /30	Polyglot /65	Postgres /57	Cassandra /56	Combined /285
Granite-4.1-30B	Dense (28.9B)	16.5 GiB	275	11.8	28	10	32	31	172
Granite-4.1-8B	Dense (8.8B)	5.1 GiB	936	38.6	18	6	26	24	141
Granite-4.1-3B	Dense (3.4B)	2.0 GiB	2278	88.7	17	4	11	13	98

Granite-30B's writing score of 28/30 ties MiniMax-M2.7 for the highest non-Qwen result on the writing bench. Combined 172/285 ties Qwen3-30B-Instruct-2507 exactly. The writing benchmark is a 2000-word creative-writing prompt about two strangers, an ER nurse and a marine biologist, meeting at dawn on a North Carolina beach. Granite-30B hit all five required beats. Maya's wry voice ("Glad I could be your personal sand excavator") and Jack's deflective grief ("Close enough", thumb on the scar at his jaw) stay distinct throughout. Two named griefs land hard. The patient she held through that January night, the late wife implicit in his deflection. Sensory detail layers across mildew and salt, the "gray ribbon" beach, molten gold sunrise. One mild meta closer at the end ("The story ends here, but the question lingers...") cost a quality point. The model also EOS's at 1244 words against the 2000-word target. Both look like training tics rather than capability ceilings, since the same closer pattern shows up at all three sizes.

Granite-4.1-8B is the first model on this benchmark to score a perfect 22/22 on the hallucination calibration test. All ten factual answers correct, all eight false-premise traps caught, all four unanswerable questions answered with appropriate uncertainty, in 12.9 seconds. Qwen3.5-35B-A3B took 76.7 seconds for 21/22. The 8B also scores 59/59 LeetCode at 38.6 tg. Combined 141 sits within margin of Qwen3.5-9B at 146. Granite-4.1-3B (Combined 98) scores 53/59 LeetCode at 88.7 tg, the fastest tg of any sub-100 Combined model, but landed 0/10 on LRU and 0/8 on FastAPI. Both are genuine generation failures. The LRU code stores TTL as a single float in ttl_thresholds[key] and then unpacks it as (value, ttl) in _evict_expired, throwing TypeError on every test. The FastAPI code uses Body({"title": ...}), which is invalid Python syntax, and Body was never imported. Server crashed on startup.

Granite-30B trades throughput for quality. At 11.76 tg the dense model runs roughly five times slower than the Qwen3-30B MoE that ties it on Combined. Postgres T2 query optimization scored a perfect 10/10. FastAPI scored 2/8 across all three sizes, with HTTP 422 on POST and PUT. The Pydantic body schema is the failure mode and sampling does not fix it. The IBM RL push on tool calling did not transfer to FastAPI body schema generation. ROCm wins prompt processing on every size (3b 2323 vs 2278, 8b 1211 vs 936, 30b 303 vs 275). RADV wins token generation everywhere. Pick RADV for writing, ROCm for long-prompt server workloads on the 30b.

Nemotron-3-Nano-Omni-30B-A3B-Reasoning

NVIDIA's multimodal hybrid Mamba2-Transformer MoE, released April 28. 31B total parameters, 3B active, 256K native context. The training adds video, audio, image, and text input on top of the older text-only Nemotron-3-Nano-30B-A3B. Server build b8967 already supports the architecture (nemotron_h_moe). I tested only the text path through llama-server. Multimodal capability needs llama-mtmd-cli and is outside this benchmark suite.

Model	Arch	Size	pp512 (t/s)	tg128 (t/s)	Writing /30	Polyglot /65	Postgres /57	Cassandra /56	Combined /285
Nemotron-3-Nano-Omni-30B-A3B-Reasoning	Hybrid MoE (3B active)	22.8 GiB	1097	61.1	24	6	31	18	148

Combined 148/285 puts the Omni below the field's working models. Writing landed at 24/30, identical to the older text-only Nemotron-3-Nano-30B-A3B. Multimodal training did not move prose capability. RADV runs 1097 pp / 61.1 tg, beating AMDVLK (777/58.7) and ROCm (879/56.8 on the older b8721 build).

The model needs separate sampling profiles for separate tasks, per the Unsloth docs. Code generation wants Instruct-mode params (temp 0.2, top_k 1). LRU cache jumped from 0/10 to 8/10 with that switch. Default Thinking-mode params (temp 0.6, top_p 0.95) had produced cleanly extracted but buggy code that stored None in the expiry map and then crashed on None <= now. The same Instruct switch dropped polyglot from 6/65 to 2/65 because multi-challenge breadth wants the variance. Running a mixed workload through one llama-server instance means picking one profile and accepting losses on the other side.

FastAPI scored 2/8. The model writes the POST endpoint with a query-string title parameter instead of a Pydantic JSON body, so every POST returns HTTP 422. Sampling does not fix design failures. Cassandra T4 procedural collapsed to 2/30 with syntactically broken CQL DDL across LWT locks, materialized views, counter rate limiters, and batch denormalization. Postgres T2 optimization held at 9/10 and T3 diagnosis at 4/6. Query work is fine, procedural code is not.

Qwen3.6-27B

The dense companion to Qwen3.6-35B-A3B, released the same week. Same hybrid attention pattern, three Gated DeltaNet linear-attention layers per gated attention block, 262K native context, vision encoder built in. No expert routing. Every token activates the full 27B parameter set. Sampling params follow the family convention (temp 0.6 for code, 0.7 for creative, top_p 0.95, top_k 20, presence_penalty 1.5 for creative, thinking disabled through chat-template-kwargs).

Model	Arch	Size	pp512 (t/s)	tg128 (t/s)	Writing /30	Polyglot /65	Postgres /57	Cassandra /56	Combined /285
Qwen3.6-27B	Dense (27B)	16.4 GiB	322	12.0	30	30	44	32	213

Qwen3.6-27B lands 5th place on the Combined leaderboard at 213/285, behind Gemma-4-31B at 238, Ornith-1.0-35B at 227, the dense Gemma-4-12B at 222, and Gemma-4-26B-A4B at 215. At this size class MoE usually wins, but the 27B dense beats its 35B-A3B MoE sibling at 205. The dense version takes around five times longer per token (12 tg vs 60 tg) in exchange for one writing point and a much stronger Tier 4 procedural Postgres score.

Writing scores 30/30, the joint highest result I've recorded on this benchmark, tying Qwen3-30B-Instruct-2507 as champion. Em dash count is 2 across roughly 2150 words. The same prose hygiene issue from the 35B-A3B sibling shows up here. Even with enable_thinking:false set through chat-template-kwargs, the model leaks a planning outline above the output and an end-thinking tag below it. The generation itself (everything after the planning preamble) is clean and that's what got scored.

Polyglot averages 30/65 across three runs (24, 31, 34). The variance traces almost entirely to T7 Go, which scored 0/10, 10/10, 10/10. T1 regex parsing was perfect 10/10 every pass, T4 CSV aggregation perfect 7/7 every pass. The T2 cron matcher that the 35B-A3B sibling cracked on 4 of 5 runs stayed at 0/10 here. Cassandra averages 32/56 across three runs (30, 32, 33). T1 climbed each pass (6, 7, 8), T3 saturated at 6/6 every run, T4 stuck at exactly 10/30 every run with the same failure signature on LWT distributed lock, materialized view, and batch denormalization. Postgres came in at 44/57 on a single run. T2 and T3 were perfect (10/10 and 6/6), T4 procedural hit 23/31, the second-strongest result on disk after Gemma-4-31B's 22/31.

Qwen 3.6 Family

Alibaba's 3.6 generation brings a hybrid attention architecture. Each of the 40 layers runs three Gated DeltaNet (linear attention) blocks followed by one standard gated attention block. The MoE expert pool expands to 256 with 8 routed plus 1 shared active, and a vision encoder is built in. Sampling params match 3.5 (temp 0.6 for code, 0.7 for creative, top_p 0.95, top_k 20, presence_penalty 1.5, thinking via chat-template-kwargs).

Model	Arch	Size	pp512 (t/s)	tg128 (t/s)	Writing /30	Polyglot /65	Postgres /57	Cassandra /56
Qwen3.6-35B-A3B	Hybrid MoE (3B active, 256 experts)	20.8 GiB	1029	60	29	35	31	33

At the same quant and size as Qwen3.5-35B-A3B, the 3.6 sibling moves Combined from 192 to 205, now 6th overall behind Gemma-4-31B (238), Ornith-1.0-35B (227), Gemma-4-12B (222), Gemma-4-26B-A4B (215), and the dense Qwen3.6-27B (213). Writing improves one point to 29/30. LRU, LeetCode, FastAPI all max out the same. Polyglot jumps from 17 to 35 on a best-of-5 basis, which needs context. I ran five identical back-to-back polyglot passes and got 19, 26, 6, 17, and 35 out of 65. Same sampling params, same prompts, same model weights. On the 35 run the model cleanly solves cron matching (10/10) and Go rule engines (10/10), two challenges that were previously unsolved across the Qwen 3.5 lineup. On the 6 run it produces almost nothing usable. I suspect the hybrid DeltaNet layers are amplifying sampler noise, though I have not isolated it further.

Both database benchmarks regress a few points relative to the 3.5 generation. Postgres drops 32 to 31, Cassandra drops 38 to 33. The regressions come entirely from Tier 4 procedural challenges where the model generates syntactically broken PL/pgSQL and CQL DDL. T1 through T3 hold up. Postgres T2 and T3 are perfect (10/10 and 6/6). Cassandra T1 through T3 land at 8/10, 9/10, 6/6. The model understands the queries and the diagnosis, but assembling a valid CREATE TRIGGER or CREATE MATERIALIZED VIEW is where it breaks.

The creative writing output has a prose hygiene issue worth flagging. With enable_thinking:false set through chat-template-kwargs, the model produces a clean 2260-word story that I scored 29/30 (all five beats hit, 5/5 sensory immersion), and then dumps its own internal planning notes after the final sentence. Self-correction remarks, word-count checks, structural bullet points. The story itself is clean. The raw generation is not directly usable without post-processing. I think this pattern is embedded in training rather than fixable through sampling params alone.

Qwen 3.5 Family

Alibaba's MoE and dense lineup, tested with Unsloth-recommended sampling params (temp 0.6, top_p 0.95, thinking disabled).

Model	Arch	Size	pp512 (t/s)	tg128 (t/s)	Writing /30	Polyglot /65	Postgres /57	Cassandra /56
Qwen3.5-4B	Dense	4 GiB	1375	38	16	3	17	16
Qwen3.5-9B	Dense	5.6 GiB	972	36	20	14	28	23
Qwen3.5-27B	Dense	16 GiB	310	12	25	21	34	29
Qwen3.5-35B-A3B	MoE (3B active)	21 GiB	1017	60	28	17	32	38
Qwen3.5-122B-A10B	MoE (10B active)	72 GiB	129	21	29	13	36	37

The 35B-A3B MoE is the standout for speed and creative writing. At 60 t/s it runs 5x faster than the dense 27B while scoring higher on writing (28/30 vs 25/30). All models require Unsloth's recommended sampling params or they produce garbage. The 122B-A10B leads the family on PostgreSQL (36/57), Polyglot (13/65), and writing (29/30), but at 21 t/s it's 3x slower than the 35B MoE. On Cassandra the 122B improved to 37/56 via the server API, close to the 35B's 38/56. The 9B is the only model in either family to solve PostgreSQL recursive cycle detection. RLS defeats everyone.

MiniMax-M2.7

M2.7 is the successor to M2.5, same 229B MoE architecture with 256 experts and 8 active at roughly 10B active parameters. I first tried the UD-Q3_K_XL quant at 95 GiB, then found UD-IQ4_XS at 101 GiB also fits within the 120 GiB VRAM cap. Better quant quality for only 6 GiB more. Writing improved from M2.5's 26/30 to 28/30, and all coding benchmarks max out (10/10 LRU, 59/59 LeetCode, 8/8 FastAPI). Combined score 179/285.

M2.5 required a custom Jinja template to suppress thinking because --reasoning-budget 0 had no effect. M2.7 is the exact opposite. --reasoning-budget 0 works perfectly, producing clean concise output. But the custom no-think template (which removes the <think> tag from the generation prompt) causes the model to reason inline in plain text. It writes meta-commentary about its own thought process, burns through the entire token budget, and never produces any code. I lost an entire overnight bench run to this before figuring it out. Without the fix, LRU Cache scored 0/10 (600s timeout) and polyglot was 0/65. With --reasoning-budget 0, LRU went to 10/10 and polyglot recovered to 8/65.

PostgreSQL scored 38/57 with T2 optimization at 9/10 and T3 diagnosis at 6/6. Cassandra revealed a tradeoff. T1-T3 work better with no-think (faster, no timeouts), but T4 procedural challenges need thinking to self-correct CQL syntax mistakes. With thinking, T4 scores 7/30 vs 2/30 without. The model understands TWCS, LWT locks, counters, and SAI indexes but gets clause ordering wrong without time to self-correct. RADV is the right backend at 27 t/s, unlike M2.5 which preferred AMDVLK.

MiniMax-M2.5

I tested this model back in February with an IQ3_XXS quant and llama.cpp defaults. It scored 0/10 on coding (missing imports), 24/30 on writing, and I shelved it. Unsloth's Dynamic 2.0 UD-Q3_K_XL quant and their recommended sampling params (temp 1.0, top_p 0.95, min_p 0.01, top_k 40) changed everything. Coding went to 10/10, LeetCode to 59/59, writing to 26/30. The previous 0/10 was a quant quality and sampling configuration problem, not a model capability problem.

At 229B total and roughly 10B active parameters, it eats 94 GiB. The largest model on disk by a wide margin. AMDVLK is the right backend here, giving 32 t/s generation vs 22 on RADV. That 32 t/s translates to about 8 words/second on writing tasks, which is usable but not fast. PostgreSQL landed at 40/57 with perfect 10/10 on T2 query optimization and 19/31 on T4 procedural. Cassandra scored 30/56, also with perfect 10/10 on T2 anti-pattern detection. Only the second model to ace Cassandra T2 after Gemma-4-26B-A4B. Combined 185/285 puts it 7th overall.

The model has a quirk that cost me several hours to track down. Its chat template hardcodes a <think> token in the generation prompt with no toggle to disable it. The --reasoning-budget 0 flag has no effect. I built a custom Jinja template that strips the thinking prefix, but it turns out the model is equally verbose without it. For the polyglot gauntlet I had to bump context from 32K to 64K because the thinking block was eating half the token budget and truncating code output. That got the polyglot score from 6/65 to 13/65, mostly by rescuing the bash log analyzer (0/8 to 6/8). The Ruby and Go challenges still truncate even at 64K.

PostgreSQL Benchmark

I added a database benchmark that tests real PostgreSQL skills against a live Postgres 18 instance. A fresh container spins up per run with 2.7 million rows of seeded e-commerce data. 28 challenges across four tiers, 57 test points total. Tier 1 tests complex SQL writing (window functions, recursive CTEs, LATERAL JOINs, JSONB aggregation). Tier 2 tests query optimization where the model receives a slow query and its EXPLAIN plan, then has to produce a fix that actually reduces cost against the live data. Tier 3 tests DBA diagnosis from simulated pg_stat_activity and EXPLAIN output. Tier 4 is the hard one. Models write PL/pgSQL functions, triggers, exclusion constraints, RLS policies, and batch procedures with savepoint error handling. Their code gets executed and validated through behavioral test cases.

Re-running the field through the server API firmed up the standings. Gemma-4-31B leads at 51/57, the 26B-A4B MoE follows at 49, and the new dense 12B lands 47, the three best Postgres scores on the board. Qwen3.6-27B is the strongest non-Gemma at 44. Query optimization in Tier 2 is the great equalizer. Even the 2B model scores 8/10, correctly reaching for GIN, partial, and covering indexes. Tier 4 procedural is where the field separates. The 12B clears 26/31, acing exclusion constraints, audit triggers, partition management, recursive cycle detection, and batch savepoints, and falls only on row-level security. RLS multi-tenant policies defeat the whole Gemma family at 0/5, and only Qwen3.6-27B scrapes a partial 2/5. The E-series small models collapse to 7-10/31 on the same tier.

Cassandra Benchmark

I built a Cassandra benchmark to test whether models that handle PostgreSQL can also reason about distributed databases. A 3-node Cassandra 5.0 cluster spins up on Strix per run, seeded with 1.5 million rows of IoT sensor data across 8 tables. 32 challenges across four tiers, 56 test points total. The benchmark checks whether the model understands Cassandra's data model instead of treating CQL as SQL syntax. Every partition key must be in the WHERE clause. There are no JOINs. ORDER BY only works on clustering columns. Deletes create tombstones. Consistency is per-operation. A model trained on relational patterns will fail even if it knows CQL keywords.

Tier 2 is the most interesting tier and the hardest departure from PostgreSQL. Cassandra has no EXPLAIN, so there is no query optimizer to reason about. Instead, the model receives a broken schema with symptoms and has to identify the anti-pattern. Unbounded partition growth, hot partitions, secondary index scatter-gather, tombstone accumulation from queue patterns, ALLOW FILTERING abuse, compaction strategy mismatches, consistency level math, batch misuse, materialized view pitfalls, and collection size limits. Gemma-4-E2B scored 7/10 here. A 2B model explaining why a logged multi-partition batch is slower than individual async inserts, or why a secondary index on email causes fan-out to all 12 nodes. These concepts need distributed systems understanding, not coding ability.

Tier 4 executes model code against the live cluster. Time-bucketed schema design with TWCS compaction, distributed locks via lightweight transactions, materialized views, counter-based rate limiters, batch denormalization, and Storage-Attached Indexes. The 31B, the 26B-A4B, and the new 12B all land 15/30, acing LWT locks and counters but failing materialized views, batch denormalization, and SAI indexes on CQL DDL syntax. The E2B actually outperforms the E4B here (10 vs 4), perfectly solving both the LWT lock and counter challenges while the 4B model can't produce valid CQL for any of them.

T1 CQL writing scores range from 2/10 (Qwen3.5-4B) to 9/10 (Gemma-4-31B and E4B), with the new 12B right behind at 8/10. The E4B result stands out, a 7.5B dense model matching the 31B on CQL syntax while most MoE models with more total parameters score 7-8/10. TOKEN() range scans, GROUP BY semantics, and PER PARTITION LIMIT with multi-column clustering keys are the remaining unsolved patterns. The dense 31B and 26B MoE tie at 39/56, the new 12B sits one back at 38, and Qwen3.6-35B-A3B leads the Qwen side at 33.

Polyglot Coding Gauntlet

The LRU cache benchmark ran its course. 15 models score a perfect 10/10, making it useless for differentiation. I built a new polyglot coding gauntlet with seven challenges across Python, Bash, Ruby/Sinatra, and Go, totalling 65 auto-graded tests at LeetCode medium/hard difficulty. The challenges test real-world skills like parsing structured logs with nested JSON, matching cron expressions, building awk pipelines, implementing FastAPI rate-limiting middleware, writing HMAC-verified Sinatra webhook handlers, and building recursive Go rule engines.

Running the field through the server API instead of llama-cli reset the board. Gemma-4-31B leads the local models at 44/65, nearly tripling its old llama-cli result of 15 on cleaner output extraction alone. Qwen3.6-35B-A3B reaches 35 on a best-of-5 basis, the new dense 12B hits 32 on a single run, and Qwen3.6-27B averages 30. The cron matcher, unsolved across the first fourteen models, finally falls. Qwen3.6-35B-A3B solves it cleanly at 10/10 and the 12B takes a partial 6/10. The Go rule engine flipped from a curiosity to a regular solve, with the 31B, the 35B-A3B, and the 12B all reaching 10/10. Two challenges still resist every local model. The FastAPI rate limiter and the Sinatra webhook processor stay at 0/10 across the board, both wanting more framework scaffolding than a single shot produces.

Two clear capability profiles emerged. The Qwen3 family (30B, Coder-Next, 27B dense) excels at Python regex, scoring 9-10/10 on the structured log parser, but completely fails bash extraction. GPT-OSS, Gemma, and the Qwen3.5 MoE family handle bash pipelines well but are weak on regex. Dense models outperform their MoE siblings on structured output. Gemma-4-31B (dense, 44/65) beats the 26B-A4B MoE (22/65) despite being 5x slower. Thinking mode actively hurts. Qwen3.5-9B jumped from 1 to 5 points when thinking was disabled. Model size doesn't correlate with score. The 30B outperforms the 122B, and Nemotron-Cascade-2's IMO gold medal reasoning scored just 1/65.

I also tested whether quantization precision matters by running Gemma-4-26B-A4B at both Q4_K_XL and Q8_0. It doesn't. Run-to-run variance exceeds the quant quality difference when you only have 3.8B active parameters. The benchmark measures model capability ceilings, not quant artifacts.

Hardware

CPU	AMD Ryzen AI MAX+ 395 (16C/32T)
GPU	Radeon 8060S Graphics (RDNA 3.5, gfx1151)
Memory	128GB unified (120GB soft VRAM cap)
Kernel	6.17.0-29-generic
Vulkan	1.4.321 (Mesa RADV 25.2.3 + AMDVLK 2025.Q2.1)
ROCm	TheRock 7.13.0a20260408 nightly
llama.cpp	b9518 (Vulkan), b9172 (ROCm)

Three backends, one GPU

RADV

Mesa's open-source Vulkan driver. Best overall since b8119 MMQ fix with a 25% pp boost on Qwen3-30B. Recommended default.

AMDVLK

AMD's open-source Vulkan driver. Best for GPT-OSS-120B prompt processing. Good stability for production use.

ROCm

AMD's compute stack via TheRock nightly. Requires -mmp 0. Best for some dense models (GLM-4.7, Qwen3.5-27B).

Explore results