How well can AI agents build a compiler from scratch? Real results from OpenHands SDK and CLI agents with infinite self-loop iteration.
| # | Model | Backend | T0 | T1 | T2 | T3 | T4 | T5 | Mean Reward | Pass Score | Pipeline | 🔄 |
|---|
Mean Reward = Σ(score[i] × weight[i] × bonus[i]) / Σ(weight[i])
weight = [5%, 20%, 20%, 15%, 30%, 10%] | bonus = 1.2 (no resurrection) / 1.0 (resurrected)
Pass Score = Σ(pass[i] × pass_bonus[i]) / (6 × 1.5) × 100
pass_bonus = 1.5 (no resurrection) / 1.0 (resurrected)
Last updated: | Powered by EvoBench