Quantization Insights | Strix Benchmarks

AesSedai vs Unsloth (Qwen3.5 MoE)

For Qwen3.5 MoE models (35B-A3B, 122B-A10B), I tested quants from both AesSedai (MoE-aware quantization) and Unsloth (Dynamic 2.0). The differences are significant and measurable.

Metric	AesSedai Q4_K_M	Unsloth UD-Q4_K_XL (fixed)
Strategy	Protects attention/shared experts at Q8_0, differentiates ffn_down from ffn_gate/up	Dynamic 2.0, upcasts critical layers, new imatrix (Mar 5)
35B-A3B pp (RADV)	832 t/s	795 t/s
35B-A3B tg (AMDVLK)	53 t/s	48 t/s
35B-A3B Coding	10/10 (158s)	10/10 (228s, 148s thinking)
35B-A3B Creative Writing	~25-26/30	~28/30 (thinking)
35B-A3B Size	21 GiB	19 GiB
122B-A10B pp (RADV)	259 t/s	258 t/s
122B-A10B Coding	10/10 (415s)	10/10 (517s, 422s thinking)
122B-A10B Creative Writing	~28/30	~29/30 (thinking)

AesSedai wins on raw speed, Unsloth wins on quality with correct params. AesSedai is 30-45% faster pp and ~10% faster tg on 35B. With Unsloth-recommended sampling params and thinking mode, Unsloth UD-Q4_K_XL achieves ~28/30 creative writing (35B) and ~29/30 (122B), surpassing AesSedai's quality scores.

The Unsloth MXFP4 Bug

Critical bug found and fixed (2026-02-27). Unsloth's UD-Q4_K_XL was applying MXFP4 quantization to attention tensors by mistake. This caused coding to drop from 10/10 to 8/10 (TTL bugs from corrupted attention weights). The fix restored coding to 10/10 but creative writing quality remained slightly below AesSedai (~24 vs ~25-26/30).

Gemma-4-26B-A4B: Q4 vs Q8 New

I ran the full 7-benchmark suite on Gemma-4-26B-A4B at both UD-Q4_K_XL (16 GiB) and Q8_0 (25 GiB). The MoE architecture activates 3.8B parameters per token regardless of quant level, so I expected the differences to be negligible. They weren't.

Q8 improved creative writing by 5 points (23 to 28 out of 30). The Q4 run produced competent prose with decent character voices but relied on familiar beats. The Q8 run produced richer sensory detail, sharper dialogue, and a more specific emotional arc. The characters felt less like templates and more like people. That gap matters for a writing benchmark.

Polyglot coding gained 5 points (10 to 15 out of 65). The Q8 model solved challenges that Q4 couldn't, including two additional bash pipeline tasks. Both quants scored identically on LRU (10/10), FastAPI (8/8), and LeetCode (59/59), confirming that well-saturated coding benchmarks can't tell quants apart.

PostgreSQL moved from 44 to 45, a single point from T1 complex SQL. Cassandra dropped from 38 to 29. That 9-point regression came entirely from T4 procedural challenges (14 to 8), where the model writes CQL schemas and executes them against a live cluster. I suspect run-to-run variance rather than a real quality regression. These T4 challenges are sensitive to exact token sequences and a different code path can cascade into multiple test failures. One bad CREATE TABLE propagates through every subsequent test in that challenge.

The performance tradeoff is real. Token generation dropped 10% (52.9 to 47.7 t/s on RADV) while prompt processing actually improved 9% (1196 to 1303 t/s). Disk usage went from 16 to 25 GiB. For a model I run interactively, 48 t/s is still fast enough and the writing quality gain justified the switch. I deleted the Q4 from disk.

Benchmark	UD-Q4_K_XL (16 GiB)	Q8_0 (25 GiB)	Delta
RADV pp/tg (t/s)	1196 / 52.9	1303 / 47.7	pp +9%, tg -10%
Creative Writing /30	23	28	+5
LRU Cache /10	10	10	same
FastAPI /8	8	8	same
LeetCode /59	59	59	same
Polyglot /65	10	15	+5
PostgreSQL /57	44	45	+1
Cassandra /56	38	29	-9
Combined /285	192	194	+2

Conventional wisdom says MoE models don't benefit from higher quants because the active parameter count stays the same. This result pushes back on that. The shared expert weights and attention layers run through every token, and they respond to precision. The improvement concentrated in creative writing and multi-language coding, both tasks where subtle weight differences affect output quality more than in structured benchmarks like LRU or LeetCode where the answer space is narrow.

Kimi-Linear-48B: Q4 vs Q8

I also ran Kimi-Linear-48B-A3B at Q8_0 (49 GiB) against the primary Q4_K_M (28 GiB). The linear attention architecture activates 3B parameters per token. The results were less interesting than Gemma.

Writing quality was identical. Both runs scored 30/30 (the maximum). LRU stayed at 10/10. LeetCode improved from 57 to 59 (two more extraction successes, not model quality). Polyglot dropped from 22 to 15, likely from different runtime availability rather than quant regression. PostgreSQL gained 2 (26 to 28). Cassandra gained 1 (24 to 25). FastAPI dropped from 8 to 2, which I attribute to run variance.

Token generation took a 26% hit (72 to 53 t/s). Disk usage nearly doubled (28 to 49 GiB). The net Combined difference was zero after accounting for the API regression. I deleted the Q8 and kept Q4_K_M as the primary quant.

Two MoE models, two different outcomes. Gemma's Q8 showed genuine quality gains on creative tasks. Kimi's didn't. The difference might come down to how each architecture distributes work between shared and routed experts. Gemma 4 uses a larger shared expert pool that benefits from higher precision. Kimi's linear attention design may be more robust to quantization noise in its 3B active path.

MXFP4, QAT vs Post-Hoc

MXFP4 (Microscaling FP4) quantization behaves very differently depending on how it was applied.

QAT-Trained MXFP4

Models trained with quantization-aware training (like GPT-OSS-120B) outperform standard quants. The model learned to compensate for reduced precision during training.

GPT-OSS-120B MXFP4 at 661 pp, 53 tg. Excellent for 59 GiB.

Post-Hoc MXFP4

Applying MXFP4 to a BF16 model after training underperforms standard quants like Q4_K_M. The precision loss isn't compensated for.

Bad for Qwen3.5 MoE. Corrupted attention weights before fix.

Qwen3.5 Dense Architecture Bottleneck

Qwen3.5 dense models (4B, 9B, 27B) exhibit a token generation speed bottleneck that is consistent across all backends and quants. This appears to be baked into the architecture itself rather than caused by quantization or driver choice.

Model	Size	tg (RADV)	tg (AMDVLK)	tg (ROCm)	Expected tg
Qwen3.5-4B Q8_0	4 GiB	37.9	40.0	36.3	~80+
Qwen3.5-9B Q8_0	9 GiB	22.5	23.3	22.3	~60+
Qwen3.5-27B UD-Q4_K_XL	16 GiB	11.35	11.90	11.28	~40+

The tg speeds barely vary across backends, typically within 1-2 t/s of each other. Compare to Qwen3-30B (MoE) which gets 75 t/s on RADV at a similar model size. The bottleneck is in how the dense Qwen3.5 architecture interacts with the inference engine, not in raw compute throughput.

Impact on Usability

27B is impractical at 6-11 t/s. Too slow for interactive use. Coding is fixed (10/10 with Unsloth params) but speed remains the bottleneck.
9B is usable at 22-23 t/s. Acceptable for generation tasks, 10/10 coding in no-think mode.
4B is surprisingly capable at 38-40 t/s. Good for lightweight tasks, 9/10 coding.
pp is not bottlenecked. Qwen3.5-4B hits 1375 pp on RADV, showing the issue is tg-specific.

Qwen3.5 Sampling Parameters (Critical)

Qwen3.5 models require specific sampling parameters recommended by Unsloth. Default llama.cpp params cause dramatic failures. 0/10 coding on the 27B, prose loops on the 35B, and excessive repetition across all sizes.

Task	Mode	temp	top_p	top_k	presence_penalty	repeat_penalty
Coding	no-think	0.6	0.95	20	0.0	1.0
Creative/General	no-think	0.7	0.8	20	1.5	1.0
Coding	thinking	0.6	0.95	20	0.0	1.0
General	thinking	1.0	0.95	20	1.5	1.0

Use --chat-template-kwargs '{"enable_thinking":false}' and NOT --reasoning-budget 0 which produces zero output. The presence_penalty=1.5 is the single most impactful param. It eliminated the 35B prose loop and halved colon count on 27B. It hurts 122B prose though (5 vs 15 colons).

Qwen3.5-35B-A3B Prose Loop (Resolved)

Previously reported as an "architecture-level issue" affecting all quants. Root cause was missing sampling parameters. With Unsloth-recommended presence_penalty=1.5, the loop is completely eliminated.

Configuration	Loop?	Words	Colons
Default params (all quants)	Yes	19-21k	varies
Unsloth params, no-think	No	3,474	5
Unsloth params, thinking	No	5,121	11

Turned out it wasn't an architecture issue. The model needed presence_penalty=1.5 to prevent repetition degeneration. No-think mode produces the best prose adherence (5 colons, 0 semicolons). Thinking mode overshoots word count but doesn't loop.