Models | Strix Benchmarks

Performance

Model	Quant	Size	RADV pp	RADV tg	AMDVLK pp	AMDVLK tg
Gemma-4-E2B	UD-Q4_K_XL	2.9 GiB	3382	109	954	99
Granite-4.1-3B	UD-Q4_K_XL	2.0 GiB	2278	88.7	652	88.1
Gemma-4-E4B	UD-Q4_K_XL	4.7 GiB	1828	59	491	57
GPT-OSS-20B-Derestricted	MXFP4	11 GiB	1405	77	1380	77
Qwen3.5-4B	Q8_0	4 GiB	1375	37.9	510	40.0
Gemma-4-26B-A4B	Q8_0	25 GiB	1303	47.7	790	47.1
Qwen3-30B-Instruct-2507	UD-Q4_K_XL	16.5 GiB	1143	92	936	93
Nemotron-3-Nano-30B-A3B	UD-Q4_K_XL	21.3 GiB	1106	68	776	65
Nemotron-3-Nano-Omni-30B-A3B-Reasoning	UD-Q4_K_XL	22.8 GiB	1097	61.1	777	58.7
Nex-N2-mini	s-batman Q8_0	34.4 GiB	1083	54.5	-	-
Qwen3.6-35B-A3B	Unsloth UD-Q4_K_XL	20.8 GiB	1029	59.6	615	57.6
Ornith-1.0-35B	Q8_0	34.36 GiB	1027	53.2	725	52.5
Qwen3.5-35B-A3B	Unsloth UD-Q4_K_XL	21 GiB	1017	59.8	686	60.4
GLM-4.7-Flash	UD-Q4_K_XL	16.3 GiB	990	73	529	70
Qwen3.5-9B	UD-Q4_K_XL	5.6 GiB	972	35.7	289	35.2
Nemotron-Cascade-2-30B-A3B	Q8_0	31 GiB	968	54	n/a	n/a
Granite-4.1-8B	UD-Q4_K_XL	5.1 GiB	936	38.6	265	38.7
Kimi-Linear-48B-A3B	Q8_0	48.6 GiB	746	52	570	53
Gemma-4-12B	Unsloth UD-Q8_K_XL	13.6 GiB	716	14.0	199	14.0
Ministral-3-14B	UD-Q4_K_XL	7.8 GiB	696	25	173	25
GPT-OSS-120B	MXFP4	59 GiB	596	56	661	53
Qwen3-Coder-Next-80B	MXFP4	41 GiB	586	40	462	43
Magistral-Small-2509	UD-Q4_K_XL	13.5 GiB	389	15	94	15
Devstral-Small-2-24B	UD-Q4_K_XL	13.5 GiB	382	15	94	15
Mistral-Small-4-119B	Unsloth UD-Q4_K_XL	69 GiB	363	40.3	313	39.2
Qwen3.6-27B	Unsloth UD-Q4_K_XL	16.4 GiB	322	12.0	86	12.0
Qwen3.5-27B	UD-Q4_K_XL	16 GiB	310	12.1	86	11.9
Qwen3.5-122B-A10B	Unsloth UD-Q4_K_XL	72 GiB	287	22.4	197	21.9
Granite-4.1-30B	UD-Q4_K_XL	16.5 GiB	275	11.8	71.2	11.9
Gemma-4-31B	Unsloth UD-Q4_K_XL (Apr 11)	17.5 GiB	261	11.1	70.8	11.1
MiniMax-M2.7	Unsloth UD-IQ4_XS	101 GiB	181	27.0	179	24.8
MiniMax-M2.5	Unsloth UD-Q3_K_XL	94 GiB	179	22	164	32
Mistral-Medium-3.5-128B	Unsloth UD-Q4_K_XL	70.5 GiB	62	2.9	17	2.9

Quality

The Tooling column is new and tracks automation reliability out of 65. It is scored separately and does not count toward the Combined total. Models marked with a hyphen have not run it yet. Scores here are for each model's reference quant. The 31B row keeps its Unsloth quant numbers as the stable reference, but the quant I actually keep on disk now is Google's QAT Q4_0, which matches these within noise on quality and pulls ahead on tooling, disk, and speed. Its full-suite breakdown and the whole QAT comparison run live on the quantization page.

Model	Writing /30	LRU /10	FastAPI /8	LeetCode /59	Polyglot /65	Postgres /57	Cassandra /56	Tooling /65	Combined /285
Gemma-4-31B	27	10	8	59	44	51	39	54.5	238
Ornith-1.0-35B New	28	10	8	59	48	46	28	61	227
Gemma-4-12B	28	10	8	59	32	47	38	54	222
Gemma-4-26B-A4B	28	10	8	59	22	49	39	59	215
Qwen3.6-27B	30	10	8	59	30	44	32	60.5	213
Mistral-Medium-3.5-128B	30	7	8	59	30	39	34	-	207
Qwen3.6-35B-A3B	29	10	8	59	35	31	33	53.5	205
Qwen3.5-122B-A10B	29	10	8	59	13	36	37	-	192
Qwen3.5-35B-A3B	28	10	8	59	17	32	38	-	192
Kimi-Linear-48B-A3B	30	10	8	59	22	30	31	-	190
Qwen3.5-27B	25	10	8	59	21	34	29	-	186
MiniMax-M2.5	26	10	7	59	13	40	30	-	185
Nex-N2-mini	28	10	8	53	17	38	29	57	183
GPT-OSS-120B	20	10	8	59	14	40	31	-	182
MiniMax-M2.7	28	10	8	59	8	38	28	-	179
Gemma-4-E4B	23	7	8	59	13	35	34	-	179
Mistral-Small-4-119B	21	10	7	59	24	27	26	-	174
Qwen3-30B-Instruct-2507	30	10	2	59	13	27	31	-	172
Granite-4.1-30B	28	10	2	59	10	32	31	-	172
Qwen3-Coder-Next-80B	26	10	2	59	9	33	32	-	171
Devstral-Small-2-24B	27	10	2	59	11	29	31	-	169
GPT-OSS-20B-Derestricted	13	10	8	59	14	37	23	-	164
Gemma-4-E2B	18	7	8	59	10	27	27	-	156
Nemotron-3-Nano-Omni-30B-A3B-Reasoning	24	8	2	59	6	31	18	-	148
Ministral-3-14B	26	2	2	59	16	23	18	-	146
Qwen3.5-9B	20	10	0	51	14	28	23	-	146
Granite-4.1-8B	18	4	4	59	6	26	24	-	141
Nemotron-Cascade-2-30B-A3B	18	10	8	59	1	22	21	-	139
Qwen3.5-4B	16	9	8	54	3	17	16	-	123
Nemotron-3-Nano-30B-A3B	20	4	0	46	7	16	16	-	109
Magistral-Small-2509	20	0	8	30	2	12	35	-	107
Granite-4.1-3B	17	0	0	53	4	11	13	-	98
GLM-4.7-Flash	14	0	0	16	0	23	27	-	80

Key Findings

RADV dominates prompt processing across all model families (50-280% faster than AMDVLK on pp). Token generation is typically tied between drivers since it's bandwidth-bound. AMDVLK only wins on GPT-OSS-120B pp (661 vs 596 RADV).
Kimi-Linear-48B is the new polyglot leader at 22/65, dethroning the GPT-OSS/Gemma trio (14/65). It's one of only two models to score on the Go rule engine (10/10) alongside Qwen3-30B (3/10). Linear attention gives it 72 t/s at 28 GiB.
Devstral-Small-2 punches above its weight. A 24B dense coding model hitting 29/57 Postgres and 31/56 Cassandra while scoring 10/10 LRU and 59/59 LeetCode. The 15 t/s generation speed hurts but the quality-per-parameter is impressive.
Nemotron-3-Nano collapses without thinking mode. 0/57 Postgres, 0/56 Cassandra, 0/59 LeetCode. The Mamba-2 hybrid architecture appears to need reasoning tokens enabled to produce structured output. Fast (1106 pp, 68 tg) but unusable for coding or database tasks with no-think.
Magistral-Small has the highest Cassandra score (35/56) of any non-Gemma model, beating Devstral (31) and Kimi (24). But it scores 0/10 LRU, 12/57 Postgres, and leaked its reasoning scaffold into the creative-writing test output. A specialist, not a generalist.
GLM-4.7-Flash can't code. 0/10 LRU, 0/8 FastAPI, 16/59 LeetCode (extraction failures), 0/65 polyglot. But it handles database work (23/57 PG, 27/56 Cass) and generates at 73 t/s. A fast model with a very narrow skill set.
Ornith-1.0-35B is DeepReinforce's agentic-coding RL fine-tune of Qwen3.5-35B-A3B, and it is the strongest result that base has produced on this board. Combined 227/285 puts it second overall behind Gemma-4-31B at 238, clearing its own base by 35 points and the other agentic tune Nex-N2-mini by 44. I ran it thinking-on, its native reasoning mode, the same way I ran Nex, so the 44-point gap reads cleanly as what the reinforcement training bought. The headline is polyglot, where 48/65 is the best local score on the board, four ahead of Gemma-4-31B. It got there by solving two challenges nothing local had cracked, the cron matcher at 10/10 and the sliding-window rate limiter at 9/10, with a perfect Go rule engine alongside. Only the Ruby Sinatra webhook stays at zero, where it writes real code with real logic bugs. Cassandra is the soft spot at 28/56, almost all of it in Tier 4 procedural CQL, the same syntax failures the Nex tune had. It also returned a perfect 22/22 on the hallucination calibration set, matching the base rather than beating it, and a board-topping 77/80 on the prose constraint test where it held first-person present tense with no stray colons, semicolons, or dashes. I added a --think flag to the bench harness to drive its always-on reasoning properly, and ran the suite at a 32k token budget so the long reasoning trace had room to finish before the answer.
MiniMax-M2.5 went from 0/10 coding to 10/10 after switching to Unsloth's UD-Q3_K_XL quant and their recommended sampling params (temp 1.0, top_p 0.95, min_p 0.01, top_k 40). At 94 GiB it's the largest model on disk, but 185/285 Combined puts it 6th overall. Perfect 10/10 on both PostgreSQL T2 optimization and Cassandra T2 anti-pattern detection. AMDVLK is its best backend at 32 t/s (vs 22 on RADV).
MiniMax-M2.7 is the 229B successor with an UD-IQ4_XS quant that squeezes into 101 GiB. Writing improved to 28/30 and all coding benchmarks max out (10/10 LRU, 59/59 LeetCode, 8/8 FastAPI). The model has an unusual no-think quirk. Its M2.5 predecessor required a custom Jinja template to suppress thinking. M2.7 reverses this entirely. The standard --reasoning-budget 0 flag works, but the custom template causes the model to reason inline in plain text without ever producing code. Combined 179/285 puts it just behind M2.5 at 185. PostgreSQL T2 optimization and T3 diagnosis are both perfect (9/10 and 6/6). Cassandra T4 procedural challenges benefit from leaving thinking on (7/30 vs 2/30 without). RADV is the right backend here, unlike M2.5 which preferred AMDVLK.
Qwen3.6-35B-A3B is the hybrid-attention successor to the 3.5 MoE. Combined 205/285, writing 29/30, and polyglot 35/65 put it near the top, but the 35 polyglot score is best-of-5 rather than stable single-run behavior. Database regressions come from Tier 4 procedural SQL and CQL, while raw writing output appends internal planning notes even with enable_thinking:false.
Qwen3.6-27B is the dense companion to the 35B-A3B MoE. Combined 213/285 and writing 30/30 make it stronger on quality, while 12 tg makes it roughly five times slower than the MoE sibling. Polyglot averages 30/65 with Go variance, Postgres reaches 44/57, and the raw writing output has the same planning-note leak.
Granite-4.1-30B ties MiniMax-M2.7 for the highest non-Qwen writing score at 28/30 and lands Combined 172/285. It also hits 10/10 LRU, 59/59 LeetCode, and perfect Postgres T2, but FastAPI stays at 2/8 and generation is slow at 11.76 tg. The 8B is notable for a perfect 22/22 hallucination-calibration run, while the 3B remains a speed-first model with LRU and FastAPI failures.
Mistral-Medium-3.5-128B lands Combined 207/285 and writes 30/30, but coding needs the model's high-effort reasoning interface. Default no-reasoning LRU scored 2/10, while reasoning_effort: high raised it to 7/10 with a different expiry-map bug. ROCm is the right backend for prompt processing, and token generation stays around 3 t/s across all backends because the 125B dense workload is memory-bandwidth-bound.
Nemotron-3-Nano-Omni-30B-A3B-Reasoning adds multimodal training but does not improve the text-only writing score. Combined 148/285 puts it below the working field, and task-specific sampling is required. Instruct-mode params raise LRU from 0/10 to 8/10 but lower polyglot from 6/65 to 2/65, while FastAPI and Cassandra T4 still fail on endpoint design and CQL syntax.
Sampling params transform quality results. Unsloth-recommended params (presence_penalty 1.5, top_k 20, thinking mode via chat-template-kwargs) made the difference between 0/10 and 10/10 on several models.
T2 (cron), T5 (FastAPI), and T6 (Sinatra) remain unsolved at 0/10 across all model runs. These single-shot challenges exceed current local model capability.
Gemma 4 leads both database benchmarks. Dense 31B tops PostgreSQL (46/57) and Cassandra (39/56). The 26B MoE follows closely. No other model family breaks 35/57 PG or 35/56 Cass.
Mistral-Small-4-119B is the first Mistral mixture of experts on the board, 119B total with 6B active. At RADV it runs 363 pp and 40 tg, A6B-class generation for a 119B model and far quicker than the dense Mistral-Medium-3.5 at 3 tg. Combined 174/285 rests on strong pure coding (10/10 LRU, 59/59 LeetCode, 24/65 polyglot) against a mid-pack 21/30 on creative writing. Reasoning is binary, either none or high, and high is token-hungry. The first LRU run scored 0/10 because the reasoning trace filled the entire 4096-token budget before any code reached the response, so the suite runs at --max-tokens 20480. The creative-writing output ends with an assistant sign-off offering to revise the story, which the scorecard treats as a deduction.
Nex-N2-mini is an agentic fine-tune of Qwen3.5-35B-A3B that tracks its base closely. Combined 183/285 sits 9 points under the 192 the base records. Writing, LRU, FastAPI, and polyglot match cell for cell. Postgres climbs 6 to 38/57, but Cassandra drops 9 to 29/56 on Tier 4 CQL syntax errors, and LeetCode loses the heap-merge problem. The agentic training only really shows in the one column the base never ran, automation reliability, where it scores 57/65 at 98.5% usable across the 65 calls with a perfect 100% on the JSON-envelope tier. I left thinking on for the whole suite because the server used the model's own template, so these numbers reflect its native reasoning mode rather than a suppressed one. The disk quant is s-batman's Q8_0 at 34.4 GiB, the best documented of the community GGUFs since no Unsloth build exists.
The Cassandra JSON extractor now parses with strict=False, so anti-pattern answers that pretty-print multi-line CQL inside a JSON string value still count. Models often write a CREATE TABLE across several lines with literal newlines, which strict JSON rejects even when the answer is correct. The change raised Mistral-Small-4 Tier 2 from 5/10 to 7/10. Cassandra scores recorded before 2026-06-05 predate it and may understate any model that emitted the same multi-line JSON.

Partial Results

Models evaluated before the full benchmark suite was established. These ran writing and LRU cache tests but not the complete battery. No longer on disk. Listed here for historical reference.

Performance

Model	Quant	Size	RADV pp	RADV tg	AMDVLK pp	AMDVLK tg
Step3.5-Flash	IQ3_XS	76 GiB	237	32	n/a	n/a
Nemotron-3-Super-120B-A12B	Unsloth UD-Q4_K_XL	78 GiB	196	10.2	139	9.86

Quality

Model	Writing /30	LRU /10
Ling-Flash-2.0	26	2
Nemotron-3-Super-120B-A12B	25	10
Devstral-2-123B	25	2
Solar-Open-100B	21	0
Mistral-Large-2411	20	2

Model Comparison

Performance

Quality

Key Findings

Partial Results

Performance

Quality