Performance
| Model | Quant | Size | RADV pp | RADV tg | AMDVLK pp | AMDVLK tg |
| Gemma-4-E2B | UD-Q4_K_XL | 2.9 GiB | 3382 | 109 | 954 | 99 |
| Granite-4.1-3B | UD-Q4_K_XL | 2.0 GiB | 2278 | 88.7 | 652 | 88.1 |
| Gemma-4-E4B | UD-Q4_K_XL | 4.7 GiB | 1828 | 59 | 491 | 57 |
| GPT-OSS-20B-Derestricted | MXFP4 | 11 GiB | 1405 | 77 | 1380 | 77 |
| Qwen3.5-4B | Q8_0 | 4 GiB | 1375 | 37.9 | 510 | 40.0 |
| Gemma-4-26B-A4B | Q8_0 | 25 GiB | 1303 | 47.7 | 790 | 47.1 |
| Qwen3-30B-Instruct-2507 | UD-Q4_K_XL | 16.5 GiB | 1143 | 92 | 936 | 93 |
| Nemotron-3-Nano-30B-A3B | UD-Q4_K_XL | 21.3 GiB | 1106 | 68 | 776 | 65 |
| Nemotron-3-Nano-Omni-30B-A3B-Reasoning | UD-Q4_K_XL | 22.8 GiB | 1097 | 61.1 | 777 | 58.7 |
| Qwen3.6-35B-A3B | Unsloth UD-Q4_K_XL | 20.8 GiB | 1029 | 59.6 | 615 | 57.6 |
| Qwen3.5-35B-A3B | Unsloth UD-Q4_K_XL | 21 GiB | 1017 | 59.8 | 686 | 60.4 |
| GLM-4.7-Flash | UD-Q4_K_XL | 16.3 GiB | 990 | 73 | 529 | 70 |
| Qwen3.5-9B | UD-Q4_K_XL | 5.6 GiB | 972 | 35.7 | 289 | 35.2 |
| Nemotron-Cascade-2-30B-A3B | Q8_0 | 31 GiB | 968 | 54 | n/a | n/a |
| Granite-4.1-8B | UD-Q4_K_XL | 5.1 GiB | 936 | 38.6 | 265 | 38.7 |
| Kimi-Linear-48B-A3B | Q8_0 | 48.6 GiB | 746 | 52 | 570 | 53 |
| Gemma-4-12B | Unsloth UD-Q8_K_XL | 13.6 GiB | 716 | 14.0 | 199 | 14.0 |
| Ministral-3-14B | UD-Q4_K_XL | 7.8 GiB | 696 | 25 | 173 | 25 |
| GPT-OSS-120B | MXFP4 | 59 GiB | 596 | 56 | 661 | 53 |
| Qwen3-Coder-Next-80B | MXFP4 | 41 GiB | 586 | 40 | 462 | 43 |
| Magistral-Small-2509 | UD-Q4_K_XL | 13.5 GiB | 389 | 15 | 94 | 15 |
| Devstral-Small-2-24B | UD-Q4_K_XL | 13.5 GiB | 382 | 15 | 94 | 15 |
| Mistral-Small-4-119B | Unsloth UD-Q4_K_XL | 69 GiB | 363 | 40.3 | 313 | 39.2 |
| Qwen3.6-27B | Unsloth UD-Q4_K_XL | 16.4 GiB | 322 | 12.0 | 86 | 12.0 |
| Qwen3.5-27B | UD-Q4_K_XL | 16 GiB | 310 | 12.1 | 86 | 11.9 |
| Qwen3.5-122B-A10B | Unsloth UD-Q4_K_XL | 72 GiB | 287 | 22.4 | 197 | 21.9 |
| Granite-4.1-30B | UD-Q4_K_XL | 16.5 GiB | 275 | 11.8 | 71.2 | 11.9 |
| Gemma-4-31B | Unsloth UD-Q4_K_XL (Apr 11) | 17.5 GiB | 261 | 11.1 | 70.8 | 11.1 |
| MiniMax-M2.7 | Unsloth UD-IQ4_XS | 101 GiB | 181 | 27.0 | 179 | 24.8 |
| MiniMax-M2.5 | Unsloth UD-Q3_K_XL | 94 GiB | 179 | 22 | 164 | 32 |
| Mistral-Medium-3.5-128B | Unsloth UD-Q4_K_XL | 70.5 GiB | 62 | 2.9 | 17 | 2.9 |
Quality
The Tooling column is new and tracks automation reliability out of 65. It is scored separately and does not count toward the Combined total. Models marked with a hyphen have not run it yet.
| Model | Writing /30 | LRU /10 | FastAPI /8 | LeetCode /59 | Polyglot /65 | Postgres /57 | Cassandra /56 | Tooling /65 | Combined /285 |
| Gemma-4-31B | 27 | 10 | 8 | 59 | 44 | 51 | 39 | - | 238 |
| Gemma-4-12B | 28 | 10 | 8 | 59 | 32 | 47 | 38 | 54 | 222 |
| Gemma-4-26B-A4B | 28 | 10 | 8 | 59 | 22 | 49 | 39 | - | 215 |
| Qwen3.6-27B | 30 | 10 | 8 | 59 | 30 | 44 | 32 | 60.5 | 213 |
| Mistral-Medium-3.5-128B | 30 | 7 | 8 | 59 | 30 | 39 | 34 | - | 207 |
| Qwen3.6-35B-A3B | 29 | 10 | 8 | 59 | 35 | 31 | 33 | 53.5 | 205 |
| Qwen3.5-122B-A10B | 29 | 10 | 8 | 59 | 13 | 36 | 37 | - | 192 |
| Qwen3.5-35B-A3B | 28 | 10 | 8 | 59 | 17 | 32 | 38 | - | 192 |
| Kimi-Linear-48B-A3B | 30 | 10 | 8 | 59 | 22 | 30 | 31 | - | 190 |
| Qwen3.5-27B | 25 | 10 | 8 | 59 | 21 | 34 | 29 | - | 186 |
| MiniMax-M2.5 | 26 | 10 | 7 | 59 | 13 | 40 | 30 | - | 185 |
| GPT-OSS-120B | 20 | 10 | 8 | 59 | 14 | 40 | 31 | - | 182 |
| MiniMax-M2.7 | 28 | 10 | 8 | 59 | 8 | 38 | 28 | - | 179 |
| Gemma-4-E4B | 23 | 7 | 8 | 59 | 13 | 35 | 34 | - | 179 |
| Mistral-Small-4-119B | 21 | 10 | 7 | 59 | 24 | 27 | 26 | - | 174 |
| Qwen3-30B-Instruct-2507 | 30 | 10 | 2 | 59 | 13 | 27 | 31 | - | 172 |
| Granite-4.1-30B | 28 | 10 | 2 | 59 | 10 | 32 | 31 | - | 172 |
| Qwen3-Coder-Next-80B | 26 | 10 | 2 | 59 | 9 | 33 | 32 | - | 171 |
| Devstral-Small-2-24B | 27 | 10 | 2 | 59 | 11 | 29 | 31 | - | 169 |
| GPT-OSS-20B-Derestricted | 13 | 10 | 8 | 59 | 14 | 37 | 23 | - | 164 |
| Gemma-4-E2B | 18 | 7 | 8 | 59 | 10 | 27 | 27 | - | 156 |
| Nemotron-3-Nano-Omni-30B-A3B-Reasoning | 24 | 8 | 2 | 59 | 6 | 31 | 18 | - | 148 |
| Ministral-3-14B | 26 | 2 | 2 | 59 | 16 | 23 | 18 | - | 146 |
| Qwen3.5-9B | 20 | 10 | 0 | 51 | 14 | 28 | 23 | - | 146 |
| Granite-4.1-8B | 18 | 4 | 4 | 59 | 6 | 26 | 24 | - | 141 |
| Nemotron-Cascade-2-30B-A3B | 18 | 10 | 8 | 59 | 1 | 22 | 21 | - | 139 |
| Qwen3.5-4B | 16 | 9 | 8 | 54 | 3 | 17 | 16 | - | 123 |
| Nemotron-3-Nano-30B-A3B | 20 | 4 | 0 | 46 | 7 | 16 | 16 | - | 109 |
| Magistral-Small-2509 | 20 | 0 | 8 | 30 | 2 | 12 | 35 | - | 107 |
| Granite-4.1-3B | 17 | 0 | 0 | 53 | 4 | 11 | 13 | - | 98 |
| GLM-4.7-Flash | 14 | 0 | 0 | 16 | 0 | 23 | 27 | - | 80 |
Key Findings
- RADV dominates prompt processing across all model families (50-280% faster than AMDVLK on pp). Token generation is typically tied between drivers since it's bandwidth-bound. AMDVLK only wins on GPT-OSS-120B pp (661 vs 596 RADV).
- Kimi-Linear-48B is the new polyglot leader at 22/65, dethroning the GPT-OSS/Gemma trio (14/65). It's one of only two models to score on the Go rule engine (10/10) alongside Qwen3-30B (3/10). Linear attention gives it 72 t/s at 28 GiB.
- Devstral-Small-2 punches above its weight. A 24B dense coding model hitting 29/57 Postgres and 31/56 Cassandra while scoring 10/10 LRU and 59/59 LeetCode. The 15 t/s generation speed hurts but the quality-per-parameter is impressive.
- Nemotron-3-Nano collapses without thinking mode. 0/57 Postgres, 0/56 Cassandra, 0/59 LeetCode. The Mamba-2 hybrid architecture appears to need reasoning tokens enabled to produce structured output. Fast (1106 pp, 68 tg) but unusable for coding or database tasks with no-think.
- Magistral-Small has the highest Cassandra score (35/56) of any non-Gemma model, beating Devstral (31) and Kimi (24). But it scores 0/10 LRU, 12/57 Postgres, and leaked its reasoning scaffold into the creative-writing test output. A specialist, not a generalist.
- GLM-4.7-Flash can't code. 0/10 LRU, 0/8 FastAPI, 16/59 LeetCode (extraction failures), 0/65 polyglot. But it handles database work (23/57 PG, 27/56 Cass) and generates at 73 t/s. A fast model with a very narrow skill set.
- MiniMax-M2.5 went from 0/10 coding to 10/10 after switching to Unsloth's UD-Q3_K_XL quant and their recommended sampling params (temp 1.0, top_p 0.95, min_p 0.01, top_k 40). At 94 GiB it's the largest model on disk, but 185/285 Combined puts it 6th overall. Perfect 10/10 on both PostgreSQL T2 optimization and Cassandra T2 anti-pattern detection. AMDVLK is its best backend at 32 t/s (vs 22 on RADV).
- MiniMax-M2.7 is the 229B successor with an UD-IQ4_XS quant that squeezes into 101 GiB. Writing improved to 28/30 and all coding benchmarks max out (10/10 LRU, 59/59 LeetCode, 8/8 FastAPI). The model has an unusual no-think quirk. Its M2.5 predecessor required a custom Jinja template to suppress thinking. M2.7 reverses this entirely. The standard --reasoning-budget 0 flag works, but the custom template causes the model to reason inline in plain text without ever producing code. Combined 179/285 puts it just behind M2.5 at 185. PostgreSQL T2 optimization and T3 diagnosis are both perfect (9/10 and 6/6). Cassandra T4 procedural challenges benefit from leaving thinking on (7/30 vs 2/30 without). RADV is the right backend here, unlike M2.5 which preferred AMDVLK.
- Qwen3.6-35B-A3B is the hybrid-attention successor to the 3.5 MoE. Combined 205/285, writing 29/30, and polyglot 35/65 put it near the top, but the 35 polyglot score is best-of-5 rather than stable single-run behavior. Database regressions come from Tier 4 procedural SQL and CQL, while raw writing output appends internal planning notes even with
enable_thinking:false. - Qwen3.6-27B is the dense companion to the 35B-A3B MoE. Combined 213/285 and writing 30/30 make it stronger on quality, while 12 tg makes it roughly five times slower than the MoE sibling. Polyglot averages 30/65 with Go variance, Postgres reaches 44/57, and the raw writing output has the same planning-note leak.
- Granite-4.1-30B ties MiniMax-M2.7 for the highest non-Qwen writing score at 28/30 and lands Combined 172/285. It also hits 10/10 LRU, 59/59 LeetCode, and perfect Postgres T2, but FastAPI stays at 2/8 and generation is slow at 11.76 tg. The 8B is notable for a perfect 22/22 hallucination-calibration run, while the 3B remains a speed-first model with LRU and FastAPI failures.
- Mistral-Medium-3.5-128B lands Combined 207/285 and writes 30/30, but coding needs the model's high-effort reasoning interface. Default no-reasoning LRU scored 2/10, while
reasoning_effort: high raised it to 7/10 with a different expiry-map bug. ROCm is the right backend for prompt processing, and token generation stays around 3 t/s across all backends because the 125B dense workload is memory-bandwidth-bound. - Nemotron-3-Nano-Omni-30B-A3B-Reasoning adds multimodal training but does not improve the text-only writing score. Combined 148/285 puts it below the working field, and task-specific sampling is required. Instruct-mode params raise LRU from 0/10 to 8/10 but lower polyglot from 6/65 to 2/65, while FastAPI and Cassandra T4 still fail on endpoint design and CQL syntax.
- Sampling params transform quality results. Unsloth-recommended params (presence_penalty 1.5, top_k 20, thinking mode via chat-template-kwargs) made the difference between 0/10 and 10/10 on several models.
- T2 (cron), T5 (FastAPI), and T6 (Sinatra) remain unsolved at 0/10 across all model runs. These single-shot challenges exceed current local model capability.
- Gemma 4 leads both database benchmarks. Dense 31B tops PostgreSQL (46/57) and Cassandra (39/56). The 26B MoE follows closely. No other model family breaks 35/57 PG or 35/56 Cass.
- Mistral-Small-4-119B is the first Mistral mixture of experts on the board, 119B total with 6B active. At RADV it runs 363 pp and 40 tg, A6B-class generation for a 119B model and far quicker than the dense Mistral-Medium-3.5 at 3 tg. Combined 174/285 rests on strong pure coding (10/10 LRU, 59/59 LeetCode, 24/65 polyglot) against a mid-pack 21/30 on creative writing. Reasoning is binary, either none or high, and high is token-hungry. The first LRU run scored 0/10 because the reasoning trace filled the entire 4096-token budget before any code reached the response, so the suite runs at --max-tokens 20480. The creative-writing output ends with an assistant sign-off offering to revise the story, which the scorecard treats as a deduction.
- The Cassandra JSON extractor now parses with strict=False, so anti-pattern answers that pretty-print multi-line CQL inside a JSON string value still count. Models often write a CREATE TABLE across several lines with literal newlines, which strict JSON rejects even when the answer is correct. The change raised Mistral-Small-4 Tier 2 from 5/10 to 7/10. Cassandra scores recorded before 2026-06-05 predate it and may understate any model that emitted the same multi-line JSON.
Partial Results
Models evaluated before the full benchmark suite was established. These ran writing and LRU cache tests but not the complete battery. No longer on disk. Listed here for historical reference.
Performance
| Model | Quant | Size | RADV pp | RADV tg | AMDVLK pp | AMDVLK tg |
| Step3.5-Flash | IQ3_XS | 76 GiB | 237 | 32 | n/a | n/a |
| Nemotron-3-Super-120B-A12B | Unsloth UD-Q4_K_XL | 78 GiB | 196 | 10.2 | 139 | 9.86 |
Quality
| Model | Writing /30 | LRU /10 |
| Ling-Flash-2.0 | 26 | 2 |
| Nemotron-3-Super-120B-A12B | 25 | 10 |
| Devstral-2-123B | 25 | 2 |
| Solar-Open-100B | 21 | 0 |
| Mistral-Large-2411 | 20 | 2 |