🚀 The First Serial Agent Benchmark

EvoBench

Can AI agents evolve? Test them on a real compiler construction pipeline — from lexer to code generation — with automatic resurrection when they fail.

Get Started Leaderboard

6 Serial Tasks

19 Models Tested

73 Test Cases

∞ Self-Loop

🔧

Setup

→

📝

Lexer

→

🌳

Parser

→

⚙️

IR Gen

→

🔥

IR Opt

→

🏗️

Asm Gen

🔄 Auto-Resurrection on Failure

Why EvoBench?

Existing benchmarks test isolated tasks. EvoBench tests evolution.

🔗

Strong Serial Pipeline

Tasks are causally dependent. Task 2 needs Task 1's output. This tests real knowledge accumulation.

🔄

Auto-Resurrection

When an agent fails at Task N, golden answers are injected so we can still evaluate Task N+1.

🤖

Multi-Backend

Supports OpenHands SDK, Claude Code, Codex CLI, Kimi CLI, Gemini CLI, and raw OpenAI API.

♾️

Infinite Self-Loop

Agents iterate autonomously until task completion. No artificial turn limits. Real autonomous coding.

📊

Multi-Dimensional Metrics

Mean Reward with resurrection bonus, Pass Score, pipeline score, and node pass rates.

🎓

Real-World Difficulty

Based on a full-semester compiler course. Students take months. Can AI do it in hours?

How It Works

Choose Your Agent

Select from OpenHands SDK, Claude Code, Codex, Gemini CLI, or any OpenAI-compatible API.

Run the Pipeline

evo run --backend openhands --model your-model

Watch Evolution

Agent reads docs, writes code, compiles, tests, debugs — autonomously iterating.

Get Results

Detailed JSON/CSV/Markdown reports with multi-dimensional metrics.

Ready to Test Your Agent?

pip install evobench
evo run --backend openhands --model your-model --tasks 0-5

View on GitHub