Can AI agents evolve? Test them on a real compiler construction pipeline β from lexer to code generation β with automatic resurrection when they fail.
Existing benchmarks test isolated tasks. EvoBench tests evolution.
Tasks are causally dependent. Task 2 needs Task 1's output. This tests real knowledge accumulation.
When an agent fails at Task N, golden answers are injected so we can still evaluate Task N+1.
Supports OpenHands SDK, Claude Code, Codex CLI, Kimi CLI, Gemini CLI, and raw OpenAI API.
Agents iterate autonomously until task completion. No artificial turn limits. Real autonomous coding.
Mean Reward with resurrection bonus, Pass Score, pipeline score, and node pass rates.
Based on a full-semester compiler course. Students take months. Can AI do it in hours?
Select from OpenHands SDK, Claude Code, Codex, Gemini CLI, or any OpenAI-compatible API.
evo run --backend openhands --model your-model
Agent reads docs, writes code, compiles, tests, debugs β autonomously iterating.
Detailed JSON/CSV/Markdown reports with multi-dimensional metrics.
pip install evobench
evo run --backend openhands --model your-model --tasks 0-5