A model is shown a design spec and a starting graph. It emits structured graph edits. A deterministic verifier grades the result: is the architecture structurally valid, connected, within the parameter budget, with legal attention head configurations? No human judge, no LLM judge, no GPU. Every row below is reproducible with one command.
Loading results…
The harness is open source and has zero dependencies. The reference row replays a known-good solution per task (the ceiling: it proves every task is solvable).
git clone https://github.com/neurarch-ai/neurarch-arch-bench cd neurarch-arch-bench node leaderboard.mjs --providers=reference # no API key needed XAI_API_KEY=... node leaderboard.mjs --providers=grok ANTHROPIC_API_KEY=... node leaderboard.mjs --providers=claude GEMINI_API_KEY=... node leaderboard.mjs --providers=gemini # contamination-resistant generated split, any size, seeded node leaderboard.mjs --providers=grok --generate=50 --seed=7
The most common frontier-model failures are not exotic: attention head counts that do not divide the embedding dimension, linear layers whose input width does not match the upstream output, graphs where the input never reaches the output, and parameter budgets blown by an order of magnitude. Each failure is machine-checkable, which also makes this benchmark an RL environment: the repo ships a GRPO training loop against the same verifier.
Built by Neurarch. Rows are added only from actual harness runs (the output JSON is checked in as leaderboard-data.json); no self-reported numbers.