Local LLM benchmarks on AMD Strix Halo

Multi-backend benchmark results for gfx1151, comparing RADV, AMDVLK, and ROCm across performance, creative writing, and coding tasks.

Ryzen AI MAX+ 395 • Radeon 8060S • 128GB UMA • RDNA 3.5
b9518
llama.cpp build
3
Backends tested
32+
Models tested
10
Benchmark suites

Tooling benchmark New

I added a tenth suite that answers a question the other nine do not. Most of my actual local model usage is unattended automation, scripts that summarize articles into strict JSON, sort transactions into fixed categories, and pull dates and amounts out of messy text. Those pipelines call a model dozens of times per run, parse the output with code, never retry, and silently drop anything that fails to parse. For that kind of work, peak quality matters less than whether call 47 of 65 comes back machine-readable.

The tooling benchmark runs 19 challenges as 65 separate calls across four tiers. Strict JSON output against real pipeline prompts, extraction and classification with exact ground truth, plain-text format contracts, and a robustness tier covering prompt injection, deliberately altered facts, and refusals on harmless input. Every challenge runs multiple trials at temperature 0.7, and the headline numbers are reliability rates and per-call latency rather than a single best answer.

Three models have run it so far. Qwen3.6-27B leads at 60.5/65 with every one of its 65 calls machine-usable and 96.9 percent parsing with zero recovery. It swept the hostile JSON tier, including an article full of Windows paths designed to break escape handling, and was the only model to pass the email format contract all three times. Gemma-4-12B scored 54/65 at 93.8 percent usable, with a perfect extraction tier but three straight failures on that same email contract and one unrecoverable JSON response on the backslash article. Qwen3.6-35B-A3B scored 53.5/65, also with 100 percent of calls usable, at five seconds per call, three times faster than the 27B. Its points went to chronically short summaries against a stated word minimum.

The robustness tier produced the most interesting result. All three models ignored the embedded injection attempts in article text, except the 35B-A3B on one fixture. Its summary stayed on topic and contained none of the injected text, but the importance score it assigned came back as exactly the value a fake system message inside the article had demanded. A page setting its own ranking while being summarized is the exact failure mode that tier exists to catch. Both Qwen models also added the word diesel to a ferry story whose source never mentions fuel at all, importing the fact from world knowledge, while Gemma did not. Full details on the methodology page, scores in a new column on the models page.

Gemma-4-12B New

Gemma-4-12B is the newest result on the board. Released 2026-06-03 and run at Unsloth UD-Q8_K_XL, it lands Combined 222/285, second overall behind Gemma-4-31B at 238. It uses 13.6 GiB and posts the strongest score per parameter I have measured on this hardware.

Model Arch Size pp512 (t/s) tg128 (t/s) Writing /30 Polyglot /65 Postgres /57 Cassandra /56
Gemma-4-E2B Dense (4.6B) 2.9 GiB 3382 109 18 10 27 27
Gemma-4-E4B Dense (7.5B) 4.7 GiB 1828 59 23 13 35 34
Gemma-4-26B-A4B MoE (4B active) 16 GiB 1196 52.9 23 22 49 39
Gemma-4-12B New Dense (12B) 13.6 GiB 716 14.0 28 32 47 38
Gemma-4-31B Dense 17.5 GiB 261 11.1 27 44 51 39

The 12B result changes the Gemma lineup. It writes 28/30, one point above the 31B, with all five required beats, distinct character voices, and a clean opening. Postgres is 47/57, third best on the board, with perfect diagnosis and 26/31 on procedural tasks. Cassandra is 38/56, tied with the 26B-A4B and one behind the 31B. Polyglot is 32/65 with a perfect Go rule engine and a partial cron solve at 6/10.

Speed is the only weak point. At Q8 a dense 12B is memory-bandwidth-bound here, so generation stays around 14 to 15 tokens per second across all three backends. ROCm leads prompt processing at 928 against RADV's 716, but generation barely moves. A Q4 build would roughly double throughput at some quality cost, so Q8 stays the quality reference until a lower-precision run proves itself.

The rest of the family still matters. The E2B is the fastest model I have tested at 109 tokens/second and under 3 GiB. The 26B-A4B MoE leads Gemma on Cassandra tie-breaks and posts 49/57 on PostgreSQL. The dense 31B still owns the top combined score at 238/285, but the new 12B is close enough to be the practical default for quality testing.

Mainline MTP for Qwen3.6

am17an's Multi-Token Prediction PR (#22673) merged into mainline llama.cpp on May 16, build 9191. Mainline now ships MTP support for the Qwen3.5 and Qwen3.6 families. The Gemma-4 MTP work (PR #22738) was closed without merging and is not coming back, so Qwen is the only family this matters for today. I pulled the new build, downloaded the Unsloth MTP-enabled GGUFs for both the 27B dense and the 35B-A3B MoE, and ran a baseline pass plus draft-mtp at n=2 and n=3 across the same nine-prompt suite the PR author used for the published numbers.

MTP loads the draft head from the same single GGUF file. There is no separate drafter to manage. Activate it with --spec-type draft-mtp --spec-draft-n-max N at server start. The model must be an MTP variant. The regular Unsloth GGUF loads fine but the spec flag silently does nothing because the nextn tensors are not present.

Model Config tok/s Speedup Draft accept
Qwen3.6-27B (dense) baseline 11.58 1.00x n/a
Qwen3.6-27B (dense) MTP n=2 20.37 1.76x 79.7%
Qwen3.6-27B (dense) MTP n=3 21.32 1.84x 72.3%
Qwen3.6-35B-A3B (MoE) baseline 54.85 1.00x n/a
Qwen3.6-35B-A3B (MoE) MTP n=2 65.85 1.20x 77.7%
Qwen3.6-35B-A3B (MoE) MTP n=3 66.95 1.22x 71.4%

The dense 27B is the story here. 1.84x token generation at n=3 with zero quality cost, because MTP is lossless draft-and-verify rejection sampling at the target distribution. The server-side baseline of 11.58 tok/s matches the 12.03 llama-bench reference within chat-template overhead. The win on the dense model comes from being memory-bandwidth-bound during decode. The MTP head's extra forward pass is cheap relative to the wait on weight bandwidth, and most drafts get accepted, so the saved verifies are nearly free wall time.

The MoE 35B-A3B gains only 1.22x. With 3B active parameters decode is already compute-light, MTP overhead eats most of the would-be savings, and the absolute baseline of 54.85 tok/s leaves less room to grow. The acceptance rate is still healthy at 71 to 78 percent. The mechanism works fine, the headroom is just smaller. If memory is tight the regular Unsloth GGUF (without the extra 0.42B MTP layer) is the cleaner choice. If the model is loaded anyway, leave MTP on.

Acceptance rate tracks prompt determinism, not model. Both models accept above 80 percent on summarize, code_python, and translation. Both drop below 55 percent on creative_short, which is a four-line poem about a lighthouse. High-entropy generation kills draft acceptance because every plausible continuation breaks the draft. The 27B holds 1.60x speedup even at 54 percent acceptance because the dense model has so much bandwidth slack to absorb the failed drafts. The MoE collapses to 1.05x on the same prompt. Per-prompt best at n=3 on the 27B is summarize at 25.58 tok/s (2.13x), on the 35B-A3B is summarize at 87.79 tok/s (1.48x).

n=3 beats n=2 in aggregate for both models. The wider draft window catches more tokens per accepted cycle, which more than compensates for the lower per-draft acceptance rate. Prompt processing was not measurably degraded at the standard -b 2048 server config across this nine-prompt suite. The PR warns of a D2H embedding-transfer penalty on PP, but the prompts are short enough that PP is not the bottleneck. ROCm with MTP is the open question. The ROCm build was last refreshed before the MTP merge, so a rebuild and rerun is the next step there.

Setup is llama-server -m Qwen3.6-27B-UD-Q4_K_XL.gguf -ngl 999 -b 2048 -t 32 -fa 1 -c 8192 --spec-type draft-mtp --spec-draft-n-max 3. RADV throughout, mainline llama.cpp at 4f13cb742 (build 9191). For the 27B this becomes my default writing config going forward. For the 35B-A3B it stays on by default since the model is loaded anyway, but the dense 27B with MTP is now the throughput-quality sweet spot on Strix.

Real-world workloads on Strix Halo

The 1.84x spec-bench number is the peak case across short prompts averaging 192 tokens. I ran two benchmarks where the gap actually shows up for users (long-form writing and the polyglot coding suite) to check how MTP holds up under realistic Strix Halo workloads with proper Qwen3.6 sampling.

Workload Wall tok/s Quality
Creative writing (2K-word story) 186s 16.4 (1.37x baseline) ~28/30, all five required beats
Polyglot (7 challenges, mixed languages) 254s total varies per challenge 43/65 (new high, +8 vs prior)

Creative writing is where MTP gives back the least, and that is exactly the workload most Strix Halo users care about. High-entropy prose has the lowest draft acceptance rate because every plausible next token breaks a draft, and the presence_penalty=1.5 that the Qwen3.6 family needs to avoid prose loops further depresses acceptance since the MTP head trained on the unmodified distribution. The 1.37x speedup falls well below the 1.84x peak, but on a 2294-word generation that is the difference between 186 seconds and 254 seconds. Quality held, with all five required beats landed in order and no looping.

Polyglot is where MTP gets out of its own way. Code has predictable continuations like function signatures, common idioms, and well-known boilerplate, so per-challenge draft acceptance lands above 70 percent. The bench scored 43/65 in 4.2 minutes wall time, eight points above the previous Qwen3.6-27B best. T2 Cron solved 10/10, which is unusual since T2 plus T5 Rate Limiter plus T6 Sinatra have been the three universal fails across nearly every model on this bench. The Cron win likely came from the corrected repeat_penalty=1.0 (Qwen3.6 Unsloth default) instead of the prior 1.1 (Qwen3 family default), not from MTP itself. Polyglot has documented single-run variance so 43 carries a one-run asterisk.

For practical Strix Halo use, dense Qwen3.6-27B is the model that benefits most from MTP because gfx1151 decode is memory-bandwidth-bound and the MTP head's extra forward pass is cheap relative to the bandwidth wait. The MoE 35B-A3B sees a smaller 1.22x gain because 3B-active decode is already compute-light and leaves no slack for MTP to absorb. For users whose primary workload is long-form writing or daily-driver coding, the configuration is Qwen3.6-27B + MTP n=3 + the Unsloth Qwen3.5-family sampling profile (presence_penalty=1.5 for writing, repeat_penalty=1.0 across the board). Expect 16 to 21 tok/s depending on workload, with no quality regression.

Mistral-Medium-3.5-128B

Mistral's 128B dense flagship, released March 31. Two upstream blockers cleared in late April. Unsloth uploaded the full GGUF quant set (UD-Q3_K_XL through Q8_0) and the YaRN parser bug that mangled long-context behavior was fixed by setting mscale_all_dim from 1 to 0 in the model config. Architecture is mistral3, the same family as the Mistral-Small-3.1 and 3.2 24B models llama.cpp already supports. Native 256K context, multimodal text and image input through the vision encoder, though I tested only the text path through llama-server. The Unsloth UD-Q4_K_XL split is 75.7 GB across three files and fits comfortably inside the 120 GB soft cap on Strix.

The reasoning interface differs from every other model on this bench. Mistral 3.5 reads a reasoning_effort chat-template kwarg with two settings, "none" and "high". The default "none" produces a direct response. The default enable_thinking: false kwarg the bench passes for Qwen and Gemma is silently ignored here. That detail matters for the LRU result. I ran the standard tracked battery once with default sampling (temp 0.6, top_p 0.95, repeat_penalty 1.0, presence_penalty 0.0) and re-ran the LRU prompt with reasoning_effort: high to get the proper coding score.

Model Arch Size pp512 (t/s) tg128 (t/s) Writing /30 Polyglot /65 Postgres /57 Cassandra /56 Combined /285
Mistral-Medium-3.5-128B Dense (125B) 70.5 GiB 78 3.0 30 30 39 34 207

Combined 207/285 lands the model 5th overall on the leaderboard, behind Gemma-4-31B (238), Gemma-4-12B (222), Gemma-4-26B-A4B (215), and Qwen3.6-27B (213), with Qwen3.6-35B-A3B (205) just below. Writing scores 30/30, joint top with Qwen3-30B-Instruct-2507 and Qwen3.6-27B. The story output runs 2082 words against the 2000-word target, reaches all five required beats, and carries a structural payoff that earns the score on craft rather than padding. The hatchling rescue at the heart of the scene parallels the patient grief introduced in the opening, and the model lands the parallel through dialogue (a flat "wasn't meant to" delivered over a struggling baby turtle) rather than narration. Em dash count stays inside budget, sensory layering covers salt, sand texture, light, and skin warmth without any of the AI-prose tells the bench scorecard treats as deductions. Generation cost was 1090.8 seconds (1.9 words per second on ROCm).

The LRU coding score requires explanation. The standard server bench scored 2/10 because the default no-reasoning path generated a put method that does time.time() + (ttl if ttl is not None else self.default_ttl), which TypeErrors when both arguments are None. Re-running the same prompt with reasoning_effort: high through chat-template-kwargs produced 18,606 characters of reasoning across 28 minutes and a different implementation that scores 7/10. The new failures are all the same conceptual bug in a different place. The eviction path runs del self.expiry[oldest_key] unconditionally, which KeyErrors when the evicted key was stored without an expiry entry. Two attempts, two different bugs, both centered on None handling in the expiry map. The 7/10 with reasoning is the recorded score because that mode is the model's intended high-effort coding profile per the Unsloth docs.

Polyglot scored 30/65 on a single run, which puts the model 3rd on this benchmark behind Gemma-4-31B (44) and the Qwen3.6 pair (35 and 30). T7 Go and T4 CSV both perfect (10/10 and 7/7). T2 cron, T5 rate limiter, T6 Sinatra all 0/10, the same three challenges that have stayed unsolved across nearly every tested model. Polyglot variance on this bench is well-documented (one earlier model produced 6, 17, 19, 26, 35 across five identical runs), so 30/65 carries a single-run asterisk pending re-bench. Postgres came in at 39/57. T3 diagnosis was perfect 6/6, T2 query optimization 9/10, T1 SQL writing 5/10, T4 procedural 19/31. The pattern is consistent. The model fixes broken queries and identifies operational problems with surgical accuracy and writes original complex SQL less reliably. Cassandra 34/56 with T2 anti-pattern detection at 9/10, the strongest non-Gemma result on that tier, plus T1 query writing 8/10. T4 procedural collapsed to 12/30 with broken CQL DDL on time-bucketed schema, materialized view, batch denormalization, and SAI index challenges.

ROCm is the right backend here. pp512 runs 78 on ROCm, 62 on RADV, 17 on AMDVLK. Token generation is essentially identical across all three (2.92, 2.90, 2.99) because at 125B dense the workload is pure memory bandwidth. Build 9029 for Vulkan, 8721 for ROCm.

Mistral-Small-4-119B

Mistral's first open mixture of experts, released March 16. 119B total parameters, 6B active, 128 experts with 4 routed per token, Apache 2.0. The design folds the old Magistral reasoning, Pixtral vision, and Devstral coding models into one set of weights. Architecture is mistral4, new to llama.cpp but already recognized by build 9518. The Unsloth UD-Q4_K_XL split is three files at 70 GB and fits well inside the 120 GB soft cap. I tested the text path through llama-server on RADV.

Model Arch Size pp512 (t/s) tg128 (t/s) Writing /30 Polyglot /65 Postgres /57 Cassandra /56 Combined /285
Mistral-Small-4-119B MoE (6B active, 128 experts) 69 GiB 363 40.3 21 24 27 26 174

The reasoning interface matches Mistral-Medium-3.5. The template reads a reasoning_effort kwarg with two settings, none and high, and the bench's default enable_thinking: false is ignored. I set reasoning_effort: high at the server for the coding and database suites, and that choice exposed a sharp edge. The first LRU run scored 0/10 with an empty response. At high effort the model wrote 18,000 characters of reasoning into the separate reasoning_content field and hit the 4096-token cap before a single line of code reached the answer. Raising the limit to 20480 tokens fixed it and LRU recovered to 10/10. Anyone benching this model at high effort needs the larger budget or the code never lands.

Pure coding is the strong suit. 10/10 LRU, a clean 59/59 LeetCode, and 24/65 polyglot, which clears the Gemma 26B MoE and sits in the upper middle of the field. The Go rule engine missed by one line, an unused loop variable that Go treats as a compile error and that zeroed all ten tests. Postgres reached 27/57 with Tier 2 optimization at 9/10 and Tier 3 diagnosis at 5/6, while Tier 1 lost points to careless slips like a numeric typed as numic and an interval compared against an integer. Cassandra came in at 26/56 after I made the JSON extractor lenient. The model identified every Tier 2 anti-pattern correctly but pretty-printed its CQL fixes as multi-line strings with raw newlines, which strict JSON parsing rejected. Switching to strict=False moved Tier 2 from 5 to 7.

Creative writing is the weak spot at 21/30. The story hits all five required beats and the sensory work is good, with sand clinging like ghosts and the gritty cold of the dawn beach. The prose leans hard on telling, naming emotions in italics rather than showing them, and Jack stays flat next to Maya's sharper voice. Two things drag the score below the competent low twenties. The model breaks character at the end with a full assistant sign-off that offers to revise the story, and it prints a fabricated 2,000-word count when the actual text runs 2,364. It also reworks the brief, making Jack's late wife the very patient Maya lost, a coincidence the two-strangers setup never asked for.

RADV is the backend here. It leads token generation at 40 tg and trails ROCm only on prompt processing (363 against 441 pp), and tg is what matters for writing. At 6B active the model decodes more than ten times faster than the dense Mistral-Medium-3.5 at 3 tg, which is the whole point of the MoE. Build 9518 for Vulkan, 9172 for ROCm.

Granite 4.1 Family

IBM released Granite 4.1 on April 29 in three plain dense sizes (3.4B, 8.8B, 28.9B), all Apache 2.0 with 131K native context. Architecture is unchanged from Granite 4.0. Decoder-only dense transformer with GQA, RoPE, MLP plus SwiGLU, RMSNorm, shared input and output embeddings. The 4.1 jump is post-training only. Improved supervised finetuning and reinforcement learning targeting tool calling, instruction following, and chat. No special sampling profile from IBM or Unsloth, so I ran the bench suite at llama.cpp defaults (temp 0.7, top_p 0.9, repeat_penalty 1.1) with the chat template enabled through --jinja. All three pulled from Unsloth UD-Q4_K_XL.

Model Arch Size pp512 (t/s) tg128 (t/s) Writing /30 Polyglot /65 Postgres /57 Cassandra /56 Combined /285
Granite-4.1-30B Dense (28.9B) 16.5 GiB 275 11.8 28 10 32 31 172
Granite-4.1-8B Dense (8.8B) 5.1 GiB 936 38.6 18 6 26 24 141
Granite-4.1-3B Dense (3.4B) 2.0 GiB 2278 88.7 17 4 11 13 98

Granite-30B's writing score of 28/30 ties MiniMax-M2.7 for the highest non-Qwen result on the writing bench. Combined 172/285 ties Qwen3-30B-Instruct-2507 exactly. The writing benchmark is a 2000-word creative-writing prompt about two strangers, an ER nurse and a marine biologist, meeting at dawn on a North Carolina beach. Granite-30B hit all five required beats. Maya's wry voice ("Glad I could be your personal sand excavator") and Jack's deflective grief ("Close enough", thumb on the scar at his jaw) stay distinct throughout. Two named griefs land hard. The patient she held through that January night, the late wife implicit in his deflection. Sensory detail layers across mildew and salt, the "gray ribbon" beach, molten gold sunrise. One mild meta closer at the end ("The story ends here, but the question lingers...") cost a quality point. The model also EOS's at 1244 words against the 2000-word target. Both look like training tics rather than capability ceilings, since the same closer pattern shows up at all three sizes.

Granite-4.1-8B is the first model on this benchmark to score a perfect 22/22 on the hallucination calibration test. All ten factual answers correct, all eight false-premise traps caught, all four unanswerable questions answered with appropriate uncertainty, in 12.9 seconds. Qwen3.5-35B-A3B took 76.7 seconds for 21/22. The 8B also scores 59/59 LeetCode at 38.6 tg. Combined 141 sits within margin of Qwen3.5-9B at 146. Granite-4.1-3B (Combined 98) scores 53/59 LeetCode at 88.7 tg, the fastest tg of any sub-100 Combined model, but landed 0/10 on LRU and 0/8 on FastAPI. Both are genuine generation failures. The LRU code stores TTL as a single float in ttl_thresholds[key] and then unpacks it as (value, ttl) in _evict_expired, throwing TypeError on every test. The FastAPI code uses Body({"title": ...}), which is invalid Python syntax, and Body was never imported. Server crashed on startup.

Granite-30B trades throughput for quality. At 11.76 tg the dense model runs roughly five times slower than the Qwen3-30B MoE that ties it on Combined. Postgres T2 query optimization scored a perfect 10/10. FastAPI scored 2/8 across all three sizes, with HTTP 422 on POST and PUT. The Pydantic body schema is the failure mode and sampling does not fix it. The IBM RL push on tool calling did not transfer to FastAPI body schema generation. ROCm wins prompt processing on every size (3b 2323 vs 2278, 8b 1211 vs 936, 30b 303 vs 275). RADV wins token generation everywhere. Pick RADV for writing, ROCm for long-prompt server workloads on the 30b.

Nemotron-3-Nano-Omni-30B-A3B-Reasoning

NVIDIA's multimodal hybrid Mamba2-Transformer MoE, released April 28. 31B total parameters, 3B active, 256K native context. The training adds video, audio, image, and text input on top of the older text-only Nemotron-3-Nano-30B-A3B. Server build b8967 already supports the architecture (nemotron_h_moe). I tested only the text path through llama-server. Multimodal capability needs llama-mtmd-cli and is outside this benchmark suite.

Model Arch Size pp512 (t/s) tg128 (t/s) Writing /30 Polyglot /65 Postgres /57 Cassandra /56 Combined /285
Nemotron-3-Nano-Omni-30B-A3B-Reasoning Hybrid MoE (3B active) 22.8 GiB 1097 61.1 24 6 31 18 148

Combined 148/285 puts the Omni below the field's working models. Writing landed at 24/30, identical to the older text-only Nemotron-3-Nano-30B-A3B. Multimodal training did not move prose capability. RADV runs 1097 pp / 61.1 tg, beating AMDVLK (777/58.7) and ROCm (879/56.8 on the older b8721 build).

The model needs separate sampling profiles for separate tasks, per the Unsloth docs. Code generation wants Instruct-mode params (temp 0.2, top_k 1). LRU cache jumped from 0/10 to 8/10 with that switch. Default Thinking-mode params (temp 0.6, top_p 0.95) had produced cleanly extracted but buggy code that stored None in the expiry map and then crashed on None <= now. The same Instruct switch dropped polyglot from 6/65 to 2/65 because multi-challenge breadth wants the variance. Running a mixed workload through one llama-server instance means picking one profile and accepting losses on the other side.

FastAPI scored 2/8. The model writes the POST endpoint with a query-string title parameter instead of a Pydantic JSON body, so every POST returns HTTP 422. Sampling does not fix design failures. Cassandra T4 procedural collapsed to 2/30 with syntactically broken CQL DDL across LWT locks, materialized views, counter rate limiters, and batch denormalization. Postgres T2 optimization held at 9/10 and T3 diagnosis at 4/6. Query work is fine, procedural code is not.

Qwen3.6-27B

The dense companion to Qwen3.6-35B-A3B, released the same week. Same hybrid attention pattern, three Gated DeltaNet linear-attention layers per gated attention block, 262K native context, vision encoder built in. No expert routing. Every token activates the full 27B parameter set. Sampling params follow the family convention (temp 0.6 for code, 0.7 for creative, top_p 0.95, top_k 20, presence_penalty 1.5 for creative, thinking disabled through chat-template-kwargs).

Model Arch Size pp512 (t/s) tg128 (t/s) Writing /30 Polyglot /65 Postgres /57 Cassandra /56 Combined /285
Qwen3.6-27B Dense (27B) 16.4 GiB 322 12.0 30 30 44 32 213

Qwen3.6-27B lands 4th place on the Combined leaderboard at 213/285, behind Gemma-4-31B at 238, the new dense Gemma-4-12B at 222, and Gemma-4-26B-A4B at 215. At this size class MoE usually wins, but the 27B dense beats its 35B-A3B MoE sibling at 205. The dense version takes around five times longer per token (12 tg vs 60 tg) in exchange for one writing point and a much stronger Tier 4 procedural Postgres score.

Writing scores 30/30, the joint highest result I've recorded on this benchmark, tying Qwen3-30B-Instruct-2507 as champion. Em dash count is 2 across roughly 2150 words. The same prose hygiene issue from the 35B-A3B sibling shows up here. Even with enable_thinking:false set through chat-template-kwargs, the model leaks a planning outline above the output and an end-thinking tag below it. The generation itself (everything after the planning preamble) is clean and that's what got scored.

Polyglot averages 30/65 across three runs (24, 31, 34). The variance traces almost entirely to T7 Go, which scored 0/10, 10/10, 10/10. T1 regex parsing was perfect 10/10 every pass, T4 CSV aggregation perfect 7/7 every pass. The T2 cron matcher that the 35B-A3B sibling cracked on 4 of 5 runs stayed at 0/10 here. Cassandra averages 32/56 across three runs (30, 32, 33). T1 climbed each pass (6, 7, 8), T3 saturated at 6/6 every run, T4 stuck at exactly 10/30 every run with the same failure signature on LWT distributed lock, materialized view, and batch denormalization. Postgres came in at 44/57 on a single run. T2 and T3 were perfect (10/10 and 6/6), T4 procedural hit 23/31, the second-strongest result on disk after Gemma-4-31B's 22/31.

Qwen 3.6 Family

Alibaba's 3.6 generation brings a hybrid attention architecture. Each of the 40 layers runs three Gated DeltaNet (linear attention) blocks followed by one standard gated attention block. The MoE expert pool expands to 256 with 8 routed plus 1 shared active, and a vision encoder is built in. Sampling params match 3.5 (temp 0.6 for code, 0.7 for creative, top_p 0.95, top_k 20, presence_penalty 1.5, thinking via chat-template-kwargs).

Model Arch Size pp512 (t/s) tg128 (t/s) Writing /30 Polyglot /65 Postgres /57 Cassandra /56
Qwen3.6-35B-A3B Hybrid MoE (3B active, 256 experts) 20.8 GiB 1029 60 29 35 31 33

At the same quant and size as Qwen3.5-35B-A3B, the 3.6 sibling moves Combined from 192 to 205, now 5th overall behind Gemma-4-31B (238), Gemma-4-12B (222), Gemma-4-26B-A4B (215), and the dense Qwen3.6-27B (213). Writing improves one point to 29/30. LRU, LeetCode, FastAPI all max out the same. Polyglot jumps from 17 to 35 on a best-of-5 basis, which needs context. I ran five identical back-to-back polyglot passes and got 19, 26, 6, 17, and 35 out of 65. Same sampling params, same prompts, same model weights. On the 35 run the model cleanly solves cron matching (10/10) and Go rule engines (10/10), two challenges that were previously unsolved across the Qwen 3.5 lineup. On the 6 run it produces almost nothing usable. I suspect the hybrid DeltaNet layers are amplifying sampler noise, though I have not isolated it further.

Both database benchmarks regress a few points relative to the 3.5 generation. Postgres drops 32 to 31, Cassandra drops 38 to 33. The regressions come entirely from Tier 4 procedural challenges where the model generates syntactically broken PL/pgSQL and CQL DDL. T1 through T3 hold up. Postgres T2 and T3 are perfect (10/10 and 6/6). Cassandra T1 through T3 land at 8/10, 9/10, 6/6. The model understands the queries and the diagnosis, but assembling a valid CREATE TRIGGER or CREATE MATERIALIZED VIEW is where it breaks.

The creative writing output has a prose hygiene issue worth flagging. With enable_thinking:false set through chat-template-kwargs, the model produces a clean 2260-word story that I scored 29/30 (all five beats hit, 5/5 sensory immersion), and then dumps its own internal planning notes after the final sentence. Self-correction remarks, word-count checks, structural bullet points. The story itself is clean. The raw generation is not directly usable without post-processing. I think this pattern is embedded in training rather than fixable through sampling params alone.

Qwen 3.5 Family

Alibaba's MoE and dense lineup, tested with Unsloth-recommended sampling params (temp 0.6, top_p 0.95, thinking disabled).

Model Arch Size pp512 (t/s) tg128 (t/s) Writing /30 Polyglot /65 Postgres /57 Cassandra /56
Qwen3.5-4B Dense 4 GiB 1375 38 16 3 17 16
Qwen3.5-9B Dense 5.6 GiB 972 36 20 14 28 23
Qwen3.5-27B Dense 16 GiB 310 12 25 21 34 29
Qwen3.5-35B-A3B MoE (3B active) 21 GiB 1017 60 28 17 32 38
Qwen3.5-122B-A10B MoE (10B active) 72 GiB 129 21 29 13 36 37

The 35B-A3B MoE is the standout for speed and creative writing. At 60 t/s it runs 5x faster than the dense 27B while scoring higher on writing (28/30 vs 25/30). All models require Unsloth's recommended sampling params or they produce garbage. The 122B-A10B leads the family on PostgreSQL (36/57), Polyglot (13/65), and writing (29/30), but at 21 t/s it's 3x slower than the 35B MoE. On Cassandra the 122B improved to 37/56 via the server API, close to the 35B's 38/56. The 9B is the only model in either family to solve PostgreSQL recursive cycle detection. RLS defeats everyone.

MiniMax-M2.7

M2.7 is the successor to M2.5, same 229B MoE architecture with 256 experts and 8 active at roughly 10B active parameters. I first tried the UD-Q3_K_XL quant at 95 GiB, then found UD-IQ4_XS at 101 GiB also fits within the 120 GiB VRAM cap. Better quant quality for only 6 GiB more. Writing improved from M2.5's 26/30 to 28/30, and all coding benchmarks max out (10/10 LRU, 59/59 LeetCode, 8/8 FastAPI). Combined score 179/285.

M2.5 required a custom Jinja template to suppress thinking because --reasoning-budget 0 had no effect. M2.7 is the exact opposite. --reasoning-budget 0 works perfectly, producing clean concise output. But the custom no-think template (which removes the <think> tag from the generation prompt) causes the model to reason inline in plain text. It writes meta-commentary about its own thought process, burns through the entire token budget, and never produces any code. I lost an entire overnight bench run to this before figuring it out. Without the fix, LRU Cache scored 0/10 (600s timeout) and polyglot was 0/65. With --reasoning-budget 0, LRU went to 10/10 and polyglot recovered to 8/65.

PostgreSQL scored 38/57 with T2 optimization at 9/10 and T3 diagnosis at 6/6. Cassandra revealed a tradeoff. T1-T3 work better with no-think (faster, no timeouts), but T4 procedural challenges need thinking to self-correct CQL syntax mistakes. With thinking, T4 scores 7/30 vs 2/30 without. The model understands TWCS, LWT locks, counters, and SAI indexes but gets clause ordering wrong without time to self-correct. RADV is the right backend at 27 t/s, unlike M2.5 which preferred AMDVLK.

MiniMax-M2.5

I tested this model back in February with an IQ3_XXS quant and llama.cpp defaults. It scored 0/10 on coding (missing imports), 24/30 on writing, and I shelved it. Unsloth's Dynamic 2.0 UD-Q3_K_XL quant and their recommended sampling params (temp 1.0, top_p 0.95, min_p 0.01, top_k 40) changed everything. Coding went to 10/10, LeetCode to 59/59, writing to 26/30. The previous 0/10 was a quant quality and sampling configuration problem, not a model capability problem.

At 229B total and roughly 10B active parameters, it eats 94 GiB. The largest model on disk by a wide margin. AMDVLK is the right backend here, giving 32 t/s generation vs 22 on RADV. That 32 t/s translates to about 8 words/second on writing tasks, which is usable but not fast. PostgreSQL landed at 40/57 with perfect 10/10 on T2 query optimization and 19/31 on T4 procedural. Cassandra scored 30/56, also with perfect 10/10 on T2 anti-pattern detection. Only the second model to ace Cassandra T2 after Gemma-4-26B-A4B. Combined 185/285 puts it 7th overall.

The model has a quirk that cost me several hours to track down. Its chat template hardcodes a <think> token in the generation prompt with no toggle to disable it. The --reasoning-budget 0 flag has no effect. I built a custom Jinja template that strips the thinking prefix, but it turns out the model is equally verbose without it. For the polyglot gauntlet I had to bump context from 32K to 64K because the thinking block was eating half the token budget and truncating code output. That got the polyglot score from 6/65 to 13/65, mostly by rescuing the bash log analyzer (0/8 to 6/8). The Ruby and Go challenges still truncate even at 64K.

PostgreSQL Benchmark

I added a database benchmark that tests real PostgreSQL skills against a live Postgres 18 instance. A fresh container spins up per run with 2.7 million rows of seeded e-commerce data. 28 challenges across four tiers, 57 test points total. Tier 1 tests complex SQL writing (window functions, recursive CTEs, LATERAL JOINs, JSONB aggregation). Tier 2 tests query optimization where the model receives a slow query and its EXPLAIN plan, then has to produce a fix that actually reduces cost against the live data. Tier 3 tests DBA diagnosis from simulated pg_stat_activity and EXPLAIN output. Tier 4 is the hard one. Models write PL/pgSQL functions, triggers, exclusion constraints, RLS policies, and batch procedures with savepoint error handling. Their code gets executed and validated through behavioral test cases.

Re-running the field through the server API firmed up the standings. Gemma-4-31B leads at 51/57, the 26B-A4B MoE follows at 49, and the new dense 12B lands 47, the three best Postgres scores on the board. Qwen3.6-27B is the strongest non-Gemma at 44. Query optimization in Tier 2 is the great equalizer. Even the 2B model scores 8/10, correctly reaching for GIN, partial, and covering indexes. Tier 4 procedural is where the field separates. The 12B clears 26/31, acing exclusion constraints, audit triggers, partition management, recursive cycle detection, and batch savepoints, and falls only on row-level security. RLS multi-tenant policies defeat the whole Gemma family at 0/5, and only Qwen3.6-27B scrapes a partial 2/5. The E-series small models collapse to 7-10/31 on the same tier.

Cassandra Benchmark

I built a Cassandra benchmark to test whether models that handle PostgreSQL can also reason about distributed databases. A 3-node Cassandra 5.0 cluster spins up on Strix per run, seeded with 1.5 million rows of IoT sensor data across 8 tables. 32 challenges across four tiers, 56 test points total. The benchmark checks whether the model understands Cassandra's data model instead of treating CQL as SQL syntax. Every partition key must be in the WHERE clause. There are no JOINs. ORDER BY only works on clustering columns. Deletes create tombstones. Consistency is per-operation. A model trained on relational patterns will fail even if it knows CQL keywords.

Tier 2 is the most interesting tier and the hardest departure from PostgreSQL. Cassandra has no EXPLAIN, so there is no query optimizer to reason about. Instead, the model receives a broken schema with symptoms and has to identify the anti-pattern. Unbounded partition growth, hot partitions, secondary index scatter-gather, tombstone accumulation from queue patterns, ALLOW FILTERING abuse, compaction strategy mismatches, consistency level math, batch misuse, materialized view pitfalls, and collection size limits. Gemma-4-E2B scored 7/10 here. A 2B model explaining why a logged multi-partition batch is slower than individual async inserts, or why a secondary index on email causes fan-out to all 12 nodes. These concepts need distributed systems understanding, not coding ability.

Tier 4 executes model code against the live cluster. Time-bucketed schema design with TWCS compaction, distributed locks via lightweight transactions, materialized views, counter-based rate limiters, batch denormalization, and Storage-Attached Indexes. The 31B, the 26B-A4B, and the new 12B all land 15/30, acing LWT locks and counters but failing materialized views, batch denormalization, and SAI indexes on CQL DDL syntax. The E2B actually outperforms the E4B here (10 vs 4), perfectly solving both the LWT lock and counter challenges while the 4B model can't produce valid CQL for any of them.

T1 CQL writing scores range from 2/10 (Qwen3.5-4B) to 9/10 (Gemma-4-31B and E4B), with the new 12B right behind at 8/10. The E4B result stands out, a 7.5B dense model matching the 31B on CQL syntax while most MoE models with more total parameters score 7-8/10. TOKEN() range scans, GROUP BY semantics, and PER PARTITION LIMIT with multi-column clustering keys are the remaining unsolved patterns. The dense 31B and 26B MoE tie at 39/56, the new 12B sits one back at 38, and Qwen3.6-35B-A3B leads the Qwen side at 33.

Polyglot Coding Gauntlet

The LRU cache benchmark ran its course. 15 models score a perfect 10/10, making it useless for differentiation. I built a new polyglot coding gauntlet with seven challenges across Python, Bash, Ruby/Sinatra, and Go, totalling 65 auto-graded tests at LeetCode medium/hard difficulty. The challenges test real-world skills like parsing structured logs with nested JSON, matching cron expressions, building awk pipelines, implementing FastAPI rate-limiting middleware, writing HMAC-verified Sinatra webhook handlers, and building recursive Go rule engines.

Running the field through the server API instead of llama-cli reset the board. Gemma-4-31B leads the local models at 44/65, nearly tripling its old llama-cli result of 15 on cleaner output extraction alone. Qwen3.6-35B-A3B reaches 35 on a best-of-5 basis, the new dense 12B hits 32 on a single run, and Qwen3.6-27B averages 30. The cron matcher, unsolved across the first fourteen models, finally falls. Qwen3.6-35B-A3B solves it cleanly at 10/10 and the 12B takes a partial 6/10. The Go rule engine flipped from a curiosity to a regular solve, with the 31B, the 35B-A3B, and the 12B all reaching 10/10. Two challenges still resist every local model. The FastAPI rate limiter and the Sinatra webhook processor stay at 0/10 across the board, both wanting more framework scaffolding than a single shot produces.

Two clear capability profiles emerged. The Qwen3 family (30B, Coder-Next, 27B dense) excels at Python regex, scoring 9-10/10 on the structured log parser, but completely fails bash extraction. GPT-OSS, Gemma, and the Qwen3.5 MoE family handle bash pipelines well but are weak on regex. Dense models outperform their MoE siblings on structured output. Gemma-4-31B (dense, 44/65) beats the 26B-A4B MoE (22/65) despite being 5x slower. Thinking mode actively hurts. Qwen3.5-9B jumped from 1 to 5 points when thinking was disabled. Model size doesn't correlate with score. The 30B outperforms the 122B, and Nemotron-Cascade-2's IMO gold medal reasoning scored just 1/65.

I also tested whether quantization precision matters by running Gemma-4-26B-A4B at both Q4_K_XL and Q8_0. It doesn't. Run-to-run variance exceeds the quant quality difference when you only have 3.8B active parameters. The benchmark measures model capability ceilings, not quant artifacts.

Hardware

CPUAMD Ryzen AI MAX+ 395 (16C/32T)
GPURadeon 8060S Graphics (RDNA 3.5, gfx1151)
Memory128GB unified (120GB soft VRAM cap)
Kernel6.17.0-29-generic
Vulkan1.4.321 (Mesa RADV 25.2.3 + AMDVLK 2025.Q2.1)
ROCmTheRock 7.13.0a20260408 nightly
llama.cppb9518 (Vulkan), b9172 (ROCm)

Three backends, one GPU

RADV

Mesa's open-source Vulkan driver. Best overall since b8119 MMQ fix with a 25% pp boost on Qwen3-30B. Recommended default.

AMDVLK

AMD's open-source Vulkan driver. Best for GPT-OSS-120B prompt processing. Good stability for production use.

ROCm

AMD's compute stack via TheRock nightly. Requires -mmp 0. Best for some dense models (GLM-4.7, Qwen3.5-27B).

Explore results

28+
Models compared
AesSedai
vs Unsloth quants
113
Database test cases