Can frontier models design valid neural networks?

A model is shown a design spec and a starting graph. It emits structured graph edits. A deterministic verifier grades the result: is the architecture structurally valid, connected, within the parameter budget, with legal attention head configurations? No human judge, no LLM judge, no GPU. Every row below is reproducible with one command.

verifiable reward 12 curated + unlimited generated tasks pass^k reliability sub-ms grading RL-trainable (GRPO)

Loading results…

Reproduce any row

The harness is open source and has zero dependencies. The reference row replays a known-good solution per task (the ceiling: it proves every task is solvable).

git clone https://github.com/neurarch-ai/neurarch-arch-bench
cd neurarch-arch-bench

node leaderboard.mjs --providers=reference            # no API key needed
XAI_API_KEY=...       node leaderboard.mjs --providers=grok
ANTHROPIC_API_KEY=... node leaderboard.mjs --providers=claude
GEMINI_API_KEY=...    node leaderboard.mjs --providers=gemini

# contamination-resistant generated split, any size, seeded
node leaderboard.mjs --providers=grok --generate=50 --seed=7

Why the failures are interesting

The most common frontier-model failures are not exotic: attention head counts that do not divide the embedding dimension, linear layers whose input width does not match the upstream output, graphs where the input never reaches the output, and parameter budgets blown by an order of magnitude. Each failure is machine-checkable, which also makes this benchmark an RL environment: the repo ships a GRPO training loop against the same verifier.

Built by Neurarch. Rows are added only from actual harness runs (the output JSON is checked in as leaderboard-data.json); no self-reported numbers.