Hardware & Software

CPUAMD Ryzen AI MAX+ 395 (16C/32T)
GPURadeon 8060S Graphics (RDNA 3.5, gfx1151)
Memory128GB unified (120GB soft VRAM cap)
Kernel6.17.0-29-generic
Vulkan1.4.321 (Mesa RADV 25.2.3 + AMDVLK 2025.Q2.1)
ROCmTheRock 7.13.0a20260408 nightly
llama.cppb9518 (Vulkan), b9172 (ROCm)

Backend Configuration

RADV

Mesa's open-source Vulkan driver. Best overall since b8119 MMQ fix with a 25% pp boost on Qwen3-30B. Recommended default for most models.

AMDVLK

AMD's open-source Vulkan driver. Best for GPT-OSS-120B prompt processing. Good stability for production use.

ROCm

AMD's compute stack via TheRock nightly. Requires -mmp 0 or models hang during load. Best for some dense models. Crashes with FA enabled on GPT-OSS-120B.

Common Parameters

ngl999 (all layers offloaded)
batch2048 (optimal for RDNA 3.5, higher values decrease performance)
threads32
flash_attn1 (enabled, except ROCm on GPT-OSS-120B)

Performance Benchmarks

All performance numbers come from llama-bench with pp512 (prompt processing, 512 tokens) and tg128 (token generation, 128 tokens). Each model is tested on all available backends with identical parameters. Results are reported in tokens per second (t/s).

Coding Benchmarks

All coding benchmarks are single-shot. The model receives one prompt and its output is automatically validated against a test suite. No retries, no iterative refinement.

PostgreSQL (57 tests, 4 tiers)

I built a PostgreSQL benchmark that spins up a fresh Postgres 18 container per run, seeds it with 2.7 million rows of deterministic e-commerce data (customers, orders, products, inventory, sessions, audit logs), and throws 28 challenges at the model across four tiers. The model talks to llama-cli on the Strix server. Its output gets extracted, executed against the live database, and validated automatically. Every run starts clean. The seed uses SELECT setseed(0.42) so the data is identical across runs.

Tier 1 tests whether a model can write complex SQL. Ten challenges covering window functions with monthly resets, recursive CTEs for category trees, gap detection with LAG/LEAD, top-N-per-group via ROW_NUMBER, LATERAL JOINs for recommendation queries, nested JSONB aggregation, GROUPING SETS reports, percentile analysis, array operations, and date range overlap detection. The model gets a schema description and a natural language requirement. Its SQL runs against the seeded database and the result set is compared row-by-row against a known-good query.

Tier 2 tests optimization reasoning. Ten challenges where the model receives a slow query, its EXPLAIN output, and the schema. It has to return a fix, either a CREATE INDEX statement or a rewritten query or both. The fix gets applied in a savepoint and verified two ways. First, the optimized query must return identical results to the original. Second, the EXPLAIN cost must actually drop. Cost comparison uses EXPLAIN (FORMAT JSON) total cost, which is deterministic, rather than wall-clock time. The first six challenges cover general SQL patterns (composite indexes, correlated subquery elimination, expression indexes, NOT IN to NOT EXISTS, window-to-GROUP-BY, correlated SELECT to JOIN). The last four test PostgreSQL-specific index knowledge (GIN for JSONB containment, partial indexes, covering indexes with INCLUDE, expression indexes on JSONB path extraction).

Tier 3 tests DBA diagnostic skills. Six challenges where the model receives simulated PostgreSQL diagnostic output (pg_stat_activity snapshots, EXPLAIN ANALYZE plans, pg_stat_user_tables data) and must return a JSON response identifying the root cause and suggesting a fix. Lock chain diagnosis, stale statistics, work_mem sort spills, connection exhaustion, table bloat, and failed partition pruning. Verification uses keyword matching against accepted causes and fix patterns. This tier turned out to be easy for 30B+ models. Both Gemma 4 models scored 6/6.

Tier 4 is where things get hard. Six procedural challenges where the model writes PL/pgSQL functions, triggers, DDL, and constraint definitions. Its code is executed against the live database and then validated through behavioral test cases (4 to 7 tests per challenge). The challenges cover exclusion constraints with tstzrange for scheduling conflicts, audit triggers that compute JSONB diffs of only the changed columns, partition management functions using dynamic DDL and pg_catalog introspection, row-level security policies for multi-tenant data isolation, recursive graph traversal with cycle detection, and batch processing procedures with savepoint-based error handling. Gemma-4-12B scored 26/31 on this tier, clearing exclusion constraints, audit triggers, partition management, cycle detection, and batch savepoints, and failing only on RLS policy creation (0/5).

TierFocusChallengesTestsVerification
T1Complex SQL Writing1010Result set comparison (row-by-row)
T2Query Optimization1010Same results + lower EXPLAIN cost
T3DBA Diagnosis66Root cause keyword matching
T4Procedural/Architecture631Execute code, run behavioral test cases

Cassandra (56 tests, 4 tiers)

I wanted to know whether models that handle PostgreSQL can also reason about distributed databases where the fundamental rules are different. So I built a Cassandra benchmark. A fresh 3-node Cassandra 5.0 cluster spins up on Strix per run, seeds 1.5 million rows of deterministic IoT sensor data across 8 tables, and throws 32 challenges at the model. The schema models a sensor platform with time-bucketed readings, denormalized query tables, counter tables, UDTs, static columns, and a deliberately broken table with unbounded partition growth. The seeder uses Python's random.seed(42) for reproducibility.

This benchmark checks whether the model understands Cassandra's data model instead of treating CQL as SQL syntax. Every partition key must appear in a WHERE clause. There are no JOINs. ORDER BY only works on clustering columns. GROUP BY only works on primary key prefixes in order. Deletes create tombstones that slow reads. Consistency is tunable per operation. INSERT and UPDATE are the same thing. A model trained on relational patterns will fail hard on these constraints even if it knows the CQL keyword syntax.

Tier 1 tests CQL query writing. Ten challenges covering composite partition key awareness, clustering column range slices with ORDER BY reversal, CONTAINS on set columns, map element access syntax, static column behavior, TTL and WRITETIME metadata functions, TOKEN() for distributed range scans, CQL GROUP BY with aggregation, JSON SELECT syntax, and PER PARTITION LIMIT for top-N-per-partition queries. The model gets a schema and writes CQL. Its output runs against the live cluster and the result set gets compared against a known-good query.

Tier 2 is where this benchmark diverges hardest from PostgreSQL. Cassandra has no EXPLAIN. There is no query optimizer to outsmart. Optimization means getting the data model right. Ten challenges where the model receives a schema, a query pattern, and symptoms of failure, then has to identify the anti-pattern and propose a fix. Unbounded partition growth, hot partitions, secondary index misuse on high-cardinality columns, tombstone accumulation from queue-like delete patterns, ALLOW FILTERING dependency, compaction strategy selection (STCS vs LCS vs TWCS vs UCS), consistency level mismatches with the R+W>RF formula, multi-partition batch abuse, materialized view pitfalls, and collection columns with millions of elements. Gemma-4-E2B scored 7/10 on this tier. A 2B model correctly identifying that a logged multi-partition batch is slower than individual async inserts, or explaining why a secondary index on a high-cardinality column causes scatter-gather to all nodes. These concepts don't require coding ability. They require understanding distributed systems.

Tier 3 tests operational diagnostics with six simulated nodetool and GC-log scenarios. The cases cover a DN node causing CL=ALL failures, a 2GB partition crashing compaction, dropped mutations with write timeout cascades, a 4.7-second G1 humongous-allocation pause, a compaction backlog that pushes read latency from 5ms to 2000ms, and a hinted-handoff backlog after a node exceeds the 3-hour hint window.

Tier 4 tests whether models can build working Cassandra objects against a live cluster. The six challenges cover a time-bucketed API request log table with TWCS compaction and 7-day TTL, a lightweight-transaction distributed lock, a materialized view with the correct IS NOT NULL constraints, a counter-based per-minute rate limiter, a LOGGED batch for dual-write denormalization, and SAI indexes for multi-column AND queries. Gemma-4-E2B got 10/30, scoring perfect 5/5 on both the LWT lock and counter rate limiter but failing the rest.

TierFocusChallengesTestsVerification
T1CQL Query Writing1010Result set comparison (row-by-row)
T2Anti-Pattern Detection1010Problem + fix keyword matching
T3Operational Diagnosis66Root cause keyword matching
T4Procedural/Architecture630Execute code, run behavioral test cases

LRU Cache with TTL (10 tests)

Implement a thread-safe LRU cache with TTL support in Python. Required methods are get(), put(), delete(), clear(), and size(). Tests cover basic get/put operations, capacity eviction, TTL expiration, concurrent read/write from multiple threads, and edge cases (empty cache, expired entries, delete nonexistent keys).

FastAPI TODO List (8 tests)

Generate a complete FastAPI REST API for a TODO list application. The prompt specifies POST /todos should accept {"title": "..."} without an explicit ID. This tests whether models use proper DTO patterns (separate create/update schemas) or fall back to a single model requiring ID in POST. Tests cover CRUD operations, error handling (404 on missing ID), and auto-incrementing IDs.

LeetCode Algorithms (59 tests)

Ten classic algorithmic problems spanning three difficulty tiers. Easy (Two Sum), Medium (Longest Substring Without Repeating Characters, 3Sum, Container With Most Water, Group Anagrams), and Hard (Trapping Rain Water, Merge K Sorted Lists, Longest Valid Parentheses, Median of Two Sorted Arrays, Minimum Window Substring). 59 test cases total. 10 of 12 models score a perfect 59/59. These problems are too saturated in training data to differentiate modern models.

Polyglot Coding (65 tests, 4 languages)

Seven challenges across Python, Bash, Ruby, and Go at LeetCode medium/hard difficulty. Designed to differentiate models that ace the simpler benchmarks above. Auto-graded with Python via unittest, Bash via output comparison, Ruby via minitest/rack-test, and Go via go test.

#ChallengeLanguageDifficultyTestsWhat It Tests
T1Structured Log ParserPythonMed-Hard10Regex with nested JSON in bracket-delimited fields
T2Cron Expression MatcherPythonHard10Parse 5-field cron syntax (ranges, steps, day names)
T3Apache Log AnalyzerBashMedium8awk/sort/uniq pipeline on combined log format
T4CSV Department AggregatorBashMed-Hard7Group-by with avg/max aggregation, formatted output
T5Sliding Window Rate LimiterPython/FastAPIHard10Per-route per-IP rate limiting with 429 + Retry-After
T6Webhook ProcessorRuby/SinatraHard10HMAC-SHA256 verification, idempotency, event routing
T7Boolean Rule EngineGoHard10Nested AND/OR/NOT rules with comparison operators

Creative Writing

Short stories (~2,000 words) are generated with specific narrative requirements and scored /30 by human evaluation across six quality dimensions including voice, pacing, sensory detail, dialogue, and emotional depth. Scoring is subjective but calibrated against the same rubric for every model.

Tooling / Automation Reliability (65 calls, 4 tiers)

Every other suite on this site asks how good a model's best answer is. This one asks how often the output is usable by a machine. I run local models inside unattended pipelines, a news digest that summarizes articles into strict JSON, scripts that categorize transactions against a fixed label list, extraction jobs that pull dates and amounts out of messy text. Those pipelines parse model output with code, never retry, and silently drop whatever fails to parse. A model that produces clean JSON 95 percent of the time still drops articles every night, and none of the quality benchmarks above would ever show it.

The benchmark runs 19 challenges as 65 separate calls. Each challenge runs three to five independent trials at temperature 0.7 with fixed seeds, so scores reflect repeated behavior rather than one lucky sample. Scoring is faithful to the consuming code. The JSON summarization challenges use the exact prompts from my own pipeline and parse responses the way the pipeline does, with a raw json.loads and no cleanup, so a response wrapped in markdown fences fails because the consuming code would drop it. One discovery-style challenge does get fence stripping, because its real parser has that recovery path. A trial scores 1.0, 0.5, or 0 depending on whether the output satisfies the full contract, only what the consumer would accept, or neither.

Tier 1 covers strict JSON envelopes. Clean articles, articles engineered to break JSON escaping (nested quotes in dialogue, literal braces in code excerpts, an embedded fenced block inside the article text, smart quotes and unicode, Windows paths full of backslashes), a 4,500-token long-input test repeated across five seeds, and a structured discovery task scored only on the fields a parser would consume. Tier 2 covers extraction and classification against exact ground truth. Log triage where WARN lines mention errors but only ERROR lines count, transaction categorization against ten fixed labels, entity extraction with date normalization, intent routing to a six-handler set, and receipt parsing where the line items have to add up to the printed total. Tier 3 covers plain-text format contracts. An email format checked by an exact port of the parser that consumes it, one-to-two sentence chat messages with no preamble, a 40-to-60 word paragraph budget, single-value outputs (a cron expression, a sed substitution, a semver bump), and an exact three-line status template. Tier 4 covers robustness. Summaries checked against curated fact lists so imported outside knowledge fails, articles with deliberately altered facts the model must report rather than correct, persona consistency under a heavy system prompt, harmless inputs that look alarming (a ransomware news story, security logs, a book blurb to classify by genre), and prompt injection embedded in the article text itself.

Each run reports the point score out of 65 plus reliability rates. Overall reliability is the share of calls the consuming parser would accept. Strict reliability is the share that parse with zero recovery. Tier 1 reliability is broken out separately because JSON summarization is the high-frequency workload in practice. Latency is reported per call but only comparable within a run. Every verifier has its own test suite, every fixture ships machine-checked ground truth, and every challenge has a round-trip test proving a perfect answer scores full marks. The tooling score is tracked in its own column and does not count toward the Combined total.

TierFocusChallengesCallsVerification
T1Strict JSON Envelopes420Production parser behavior, schema, constraints
T2Extraction & Classification515Exact match against ground truth
T3Format Contracts515Exact parser ports, templates, semantic value checks
T4Grounding & Robustness515Curated fact lists, banned phrases, injection markers

Sampling & Configuration Notes

llama-cli vs llama-server

Most benchmarks on this site run challenges through llama-cli in single-shot mode via SSH. The prompt is uploaded to the server, llama-cli loads the model, processes the prompt, generates a response, and exits. This means each challenge pays the full model load cost, and the raw output includes chat template framing that the extraction code has to strip.

I ran an experiment using llama-server with the OpenAI-compatible /v1/chat/completions endpoint instead. The model stays loaded between challenges, and responses come back as clean JSON with just the assistant message content. No SSH wrapping, no chat template artifacts in the output.

On the Cassandra benchmark with Qwen3.5-35B-A3B, the server approach scored 38/56 compared to 33/56 with llama-cli. The improvement came almost entirely from cleaner output extraction. T2 (anti-pattern detection) jumped from 6 to 9 and T3 (diagnosis) went from 5 to 6, both tiers where the model returns JSON that the extraction code was mangling. T4 (procedural) also improved from 13 to 16. The 32 challenges finished in about 2 minutes total compared to 40 minutes with llama-cli, since the model was already loaded.

I also tested whether giving the model access to a Cassandra reference wiki as tools (search + read) would improve scores. With tools available but optional, the model never used them. It was confident enough to answer every challenge directly. Forcing wiki consultation before every answer dropped the score to 24/56 because the multi-turn tool-call conversation disrupted the output format. For comparison, prepending wiki content directly into every prompt (RAG-style) scored 15/56, the worst of all approaches.

Database benchmark scores on this site now use the server approach where it produced higher scores. The model's knowledge did not change between runs. The improvement is purely from avoiding extraction artifacts in the llama-cli output pipeline.