πŸš€ The First Serial Agent Benchmark

EvoBench

Can AI agents evolve? Test them on a real compiler construction pipeline β€” from lexer to code generation β€” with automatic resurrection when they fail.

6 Serial Tasks
19 Models Tested
73 Test Cases
∞ Self-Loop
πŸ”§
Setup
β†’
πŸ“
Lexer
β†’
🌳
Parser
β†’
βš™οΈ
IR Gen
β†’
πŸ”₯
IR Opt
β†’
πŸ—οΈ
Asm Gen
πŸ”„ Auto-Resurrection on Failure

Why EvoBench?

Existing benchmarks test isolated tasks. EvoBench tests evolution.

πŸ”—

Strong Serial Pipeline

Tasks are causally dependent. Task 2 needs Task 1's output. This tests real knowledge accumulation.

πŸ”„

Auto-Resurrection

When an agent fails at Task N, golden answers are injected so we can still evaluate Task N+1.

πŸ€–

Multi-Backend

Supports OpenHands SDK, Claude Code, Codex CLI, Kimi CLI, Gemini CLI, and raw OpenAI API.

♾️

Infinite Self-Loop

Agents iterate autonomously until task completion. No artificial turn limits. Real autonomous coding.

πŸ“Š

Multi-Dimensional Metrics

Mean Reward with resurrection bonus, Pass Score, pipeline score, and node pass rates.

πŸŽ“

Real-World Difficulty

Based on a full-semester compiler course. Students take months. Can AI do it in hours?

How It Works

01

Choose Your Agent

Select from OpenHands SDK, Claude Code, Codex, Gemini CLI, or any OpenAI-compatible API.

02

Run the Pipeline

evo run --backend openhands --model your-model

03

Watch Evolution

Agent reads docs, writes code, compiles, tests, debugs β€” autonomously iterating.

04

Get Results

Detailed JSON/CSV/Markdown reports with multi-dimensional metrics.

Ready to Test Your Agent?

pip install evobench
evo run --backend openhands --model your-model --tasks 0-5
View on GitHub