Methodology
Hardware, software, backend configuration, and benchmark details. Coding benchmarks are fully specified below. Creative writing evaluation is scored by human review.
Hardware & Software
| CPU | AMD Ryzen AI MAX+ 395 (16C/32T) |
| GPU | Radeon 8060S Graphics (RDNA 3.5, gfx1151) |
| Memory | 128GB unified (120GB soft VRAM cap) |
| Kernel | 6.17.0-8-generic |
| Vulkan | 1.4.321 (Mesa RADV 25.2.3 + AMDVLK 2025.Q2.1) |
| ROCm | TheRock 7.11.0a20260121 nightly |
| llama.cpp | b8638 (2026-04-02) |
Backend Configuration
RADV
Mesa's open-source Vulkan driver. Best overall since b8119 MMQ fix with a 25% pp boost on Qwen3-30B. Recommended default for most models.
AMDVLK
AMD's open-source Vulkan driver. Best for GPT-OSS-120B prompt processing. Good stability for production use.
ROCm
AMD's compute stack via TheRock nightly. Requires -mmp 0 or models hang during load. Best for some dense models. Crashes with FA enabled on GPT-OSS-120B.
Common Parameters
| ngl | 999 (all layers offloaded) |
| batch | 2048 (optimal for RDNA 3.5, higher values decrease performance) |
| threads | 32 |
| flash_attn | 1 (enabled, except ROCm on GPT-OSS-120B) |
Performance Benchmarks
All performance numbers come from llama-bench with pp512 (prompt processing, 512 tokens) and tg128 (token generation, 128 tokens). Each model is tested on all available backends with identical parameters. Results are reported in tokens per second (t/s).
Coding Benchmarks
All coding benchmarks are single-shot. The model receives one prompt and its output is automatically validated against a test suite. No retries, no iterative refinement.
PostgreSQL (57 tests, 4 tiers) New
I built a PostgreSQL benchmark that spins up a fresh Postgres 18 container per run, seeds it with 2.7 million rows of deterministic e-commerce data (customers, orders, products, inventory, sessions, audit logs), and throws 28 challenges at the model across four tiers. The model talks to llama-cli on the Strix server. Its output gets extracted, executed against the live database, and validated automatically. Every run starts clean. The seed uses SELECT setseed(0.42) so the data is identical across runs.
Tier 1 tests whether a model can write complex SQL. Ten challenges covering window functions with monthly resets, recursive CTEs for category trees, gap detection with LAG/LEAD, top-N-per-group via ROW_NUMBER, LATERAL JOINs for recommendation queries, nested JSONB aggregation, GROUPING SETS reports, percentile analysis, array operations, and date range overlap detection. The model gets a schema description and a natural language requirement. Its SQL runs against the seeded database and the result set is compared row-by-row against a known-good query.
Tier 2 tests optimization reasoning. Ten challenges where the model receives a slow query, its EXPLAIN output, and the schema. It has to return a fix, either a CREATE INDEX statement or a rewritten query or both. The fix gets applied in a savepoint and verified two ways. First, the optimized query must return identical results to the original. Second, the EXPLAIN cost must actually drop. Cost comparison uses EXPLAIN (FORMAT JSON) total cost, which is deterministic, rather than wall-clock time. The first six challenges cover general SQL patterns (composite indexes, correlated subquery elimination, expression indexes, NOT IN to NOT EXISTS, window-to-GROUP-BY, correlated SELECT to JOIN). The last four test PostgreSQL-specific index knowledge (GIN for JSONB containment, partial indexes, covering indexes with INCLUDE, expression indexes on JSONB path extraction).
Tier 3 tests DBA diagnostic skills. Six challenges where the model receives simulated PostgreSQL diagnostic output (pg_stat_activity snapshots, EXPLAIN ANALYZE plans, pg_stat_user_tables data) and must return a JSON response identifying the root cause and suggesting a fix. Lock chain diagnosis, stale statistics, work_mem sort spills, connection exhaustion, table bloat, and failed partition pruning. Verification uses keyword matching against accepted causes and fix patterns. This tier turned out to be easy for 30B+ models. Both Gemma 4 models scored 6/6.
Tier 4 is where things get hard. Six procedural challenges where the model writes PL/pgSQL functions, triggers, DDL, and constraint definitions. Its code is executed against the live database and then validated through behavioral test cases (4 to 7 tests per challenge). The challenges cover exclusion constraints with tstzrange for scheduling conflicts, audit triggers that compute JSONB diffs of only the changed columns, partition management functions using dynamic DDL and pg_catalog introspection, row-level security policies for multi-tenant data isolation, recursive graph traversal with cycle detection, and batch processing procedures with savepoint-based error handling. Gemma-4-31B scored 22/31 on this tier, acing four challenges but failing completely on RLS policy creation (0/5) and cycle detection (0/4).
| Tier | Focus | Challenges | Tests | Verification |
|---|---|---|---|---|
| T1 | Complex SQL Writing | 10 | 10 | Result set comparison (row-by-row) |
| T2 | Query Optimization | 10 | 10 | Same results + lower EXPLAIN cost |
| T3 | DBA Diagnosis | 6 | 6 | Root cause keyword matching |
| T4 | Procedural/Architecture | 6 | 31 | Execute code, run behavioral test cases |
Cassandra (56 tests, 4 tiers) New
I wanted to know whether models that handle PostgreSQL can also reason about distributed databases where the fundamental rules are different. So I built a Cassandra benchmark. A fresh 3-node Cassandra 5.0 cluster spins up on Strix per run, seeds 1.5 million rows of deterministic IoT sensor data across 8 tables, and throws 32 challenges at the model. The schema models a sensor platform with time-bucketed readings, denormalized query tables, counter tables, UDTs, static columns, and a deliberately broken table with unbounded partition growth. The seeder uses Python's random.seed(42) for reproducibility.
The core question is whether the model understands that Cassandra is not SQL with different syntax. Every partition key must appear in a WHERE clause. There are no JOINs. ORDER BY only works on clustering columns. GROUP BY only works on primary key prefixes in order. Deletes create tombstones that slow reads. Consistency is tunable per operation. INSERT and UPDATE are the same thing. A model trained on relational patterns will fail hard on these constraints even if it knows the CQL keyword syntax.
Tier 1 tests CQL query writing. Ten challenges covering composite partition key awareness, clustering column range slices with ORDER BY reversal, CONTAINS on set columns, map element access syntax, static column behavior, TTL and WRITETIME metadata functions, TOKEN() for distributed range scans, CQL GROUP BY with aggregation, JSON SELECT syntax, and PER PARTITION LIMIT for top-N-per-partition queries. The model gets a schema and writes CQL. Its output runs against the live cluster and the result set gets compared against a known-good query.
Tier 2 is where this benchmark diverges hardest from PostgreSQL. Cassandra has no EXPLAIN. There is no query optimizer to outsmart. Optimization means getting the data model right. Ten challenges where the model receives a schema, a query pattern, and symptoms of failure, then has to identify the anti-pattern and propose a fix. Unbounded partition growth, hot partitions, secondary index misuse on high-cardinality columns, tombstone accumulation from queue-like delete patterns, ALLOW FILTERING dependency, compaction strategy selection (STCS vs LCS vs TWCS vs UCS), consistency level mismatches with the R+W>RF formula, multi-partition batch abuse, materialized view pitfalls, and collection columns with millions of elements. Gemma-4-E2B scored 7/10 on this tier. A 2B model correctly identifying that a logged multi-partition batch is slower than individual async inserts, or explaining why a secondary index on a high-cardinality column causes scatter-gather to all nodes. These concepts don't require coding ability. They require understanding distributed systems.
Tier 3 tests operational diagnostics. Six challenges with simulated nodetool and GC log output. A DN node in nodetool status with CL=ALL causing NoHostAvailable exceptions. A 2GB partition in nodetool tablestats crashing compaction. Dropped mutations in nodetool tpstats with write timeout cascades. A 4.7-second G1 humongous allocation GC pause causing gossip pauses. A compaction backlog of 847 pending tasks with read latency spiking from 5ms to 2000ms. And a hinted handoff backlog of 847K hints after a node exceeded the 3-hour hint window during 4 hours of maintenance downtime.
Tier 4 tests whether models can build working Cassandra objects against a live cluster. Six challenges with multiple behavioral test cases each. Design a time-bucketed API request log table with TWCS compaction and 7-day TTL (6 tests). Implement a distributed lock using lightweight transactions with IF NOT EXISTS and TTL-based auto-expiry (5 tests). Create a materialized view with the correct IS NOT NULL constraints and verify data propagation (5 tests). Build a counter-based rate limiter that tracks requests per API key per minute bucket (5 tests). Write a LOGGED batch for atomic dual-write denormalization across two query tables (4 tests). Create SAI indexes on multiple columns and verify multi-column AND queries work without ALLOW FILTERING (5 tests). Gemma-4-E2B got 10/30, scoring perfect 5/5 on both the LWT lock and counter rate limiter but failing the rest.
| Tier | Focus | Challenges | Tests | Verification |
|---|---|---|---|---|
| T1 | CQL Query Writing | 10 | 10 | Result set comparison (row-by-row) |
| T2 | Anti-Pattern Detection | 10 | 10 | Problem + fix keyword matching |
| T3 | Operational Diagnosis | 6 | 6 | Root cause keyword matching |
| T4 | Procedural/Architecture | 6 | 30 | Execute code, run behavioral test cases |
LRU Cache with TTL (10 tests)
Implement a thread-safe LRU cache with TTL support in Python. Required methods are get(), put(), delete(), clear(), and size(). Tests cover basic get/put operations, capacity eviction, TTL expiration, concurrent read/write from multiple threads, and edge cases (empty cache, expired entries, delete nonexistent keys).
FastAPI TODO List (8 tests)
Generate a complete FastAPI REST API for a TODO list application. The prompt specifies POST /todos should accept {"title": "..."} without an explicit ID. This tests whether models use proper DTO patterns (separate create/update schemas) or fall back to a single model requiring ID in POST. Tests cover CRUD operations, error handling (404 on missing ID), and auto-incrementing IDs.
LeetCode Algorithms (59 tests)
Ten classic algorithmic problems spanning three difficulty tiers. Easy (Two Sum), Medium (Longest Substring Without Repeating Characters, 3Sum, Container With Most Water, Group Anagrams), and Hard (Trapping Rain Water, Merge K Sorted Lists, Longest Valid Parentheses, Median of Two Sorted Arrays, Minimum Window Substring). 59 test cases total. 10 of 12 models score a perfect 59/59. These problems are too saturated in training data to differentiate modern models.
Polyglot Coding (65 tests, 4 languages)
Seven challenges across Python, Bash, Ruby, and Go at LeetCode medium/hard difficulty. Designed to differentiate models that ace the simpler benchmarks above. Auto-graded with Python via unittest, Bash via output comparison, Ruby via minitest/rack-test, and Go via go test.
| # | Challenge | Language | Difficulty | Tests | What It Tests |
|---|---|---|---|---|---|
| T1 | Structured Log Parser | Python | Med-Hard | 10 | Regex with nested JSON in bracket-delimited fields |
| T2 | Cron Expression Matcher | Python | Hard | 10 | Parse 5-field cron syntax (ranges, steps, day names) |
| T3 | Apache Log Analyzer | Bash | Medium | 8 | awk/sort/uniq pipeline on combined log format |
| T4 | CSV Department Aggregator | Bash | Med-Hard | 7 | Group-by with avg/max aggregation, formatted output |
| T5 | Sliding Window Rate Limiter | Python/FastAPI | Hard | 10 | Per-route per-IP rate limiting with 429 + Retry-After |
| T6 | Webhook Processor | Ruby/Sinatra | Hard | 10 | HMAC-SHA256 verification, idempotency, event routing |
| T7 | Boolean Rule Engine | Go | Hard | 10 | Nested AND/OR/NOT rules with comparison operators |
Creative Writing
Short stories (~2,000 words) are generated with specific narrative requirements and scored /30 by human evaluation across six quality dimensions including voice, pacing, sensory detail, dialogue, and emotional depth. Scoring is subjective but calibrated against the same rubric for every model.
Sampling & Configuration Notes
- Unsloth-recommended params transformed results for Qwen3.5 models. Use
presence_penalty=1.5,top_k=20, and thinking mode via--chat-template-kwargs '{"enable_thinking":true}'with 64k context. - Thinking mode helps coding but hurts prose style. Models plan implementations more effectively but overshoot word counts and introduce more punctuation violations in creative writing.
--reasoning-budget 0does not work for disabling Qwen3.5 thinking. Use--chat-template-kwargs '{"enable_thinking":false}'instead.- Qwen3.5-27B went from 0/10 to 10/10 on coding just by switching from default params to Unsloth-recommended sampling.