ALE-Bench:面向长时程目标驱动算法工程的基准测试平台
ALE-Bench: A Benchmark for Long-Horizon Objective-Driven Algorithm Engineering
June 10, 2025
作者: Yuki Imajuku, Kohki Horie, Yoichi Iwata, Kensho Aoki, Naohiro Takahashi, Takuya Akiba
cs.AI
摘要
在包裹配送路线规划、机组调度、工厂生产计划以及电网平衡等领域的复杂优化问题中,人工智能系统的算法工程表现如何?我们推出了ALE-Bench,这是一个用于评估AI系统在基于分数的算法编程竞赛中表现的新基准。ALE-Bench借鉴了AtCoder启发式竞赛中的真实任务,提出了计算难度高且尚无已知精确解的优化问题。与短时间、通过/失败式的编程基准不同,ALE-Bench鼓励在长时间跨度内进行迭代式的解决方案优化。我们的软件框架支持利用测试运行反馈和可视化功能的交互式代理架构。对前沿大语言模型(LLMs)的评估显示,尽管它们在特定问题上表现出色,但在跨问题的一致性和长期问题解决能力方面,与人类相比仍存在显著差距。这凸显了该基准在推动未来AI进步方面的必要性。
English
How well do AI systems perform in algorithm engineering for hard optimization
problems in domains such as package-delivery routing, crew scheduling, factory
production planning, and power-grid balancing? We introduce ALE-Bench, a new
benchmark for evaluating AI systems on score-based algorithmic programming
contests. Drawing on real tasks from the AtCoder Heuristic Contests, ALE-Bench
presents optimization problems that are computationally hard and admit no known
exact solution. Unlike short-duration, pass/fail coding benchmarks, ALE-Bench
encourages iterative solution refinement over long time horizons. Our software
framework supports interactive agent architectures that leverage test-run
feedback and visualizations. Our evaluation of frontier LLMs revealed that
while they demonstrate high performance on specific problems, a notable gap
remains compared to humans in terms of consistency across problems and
long-horizon problem-solving capabilities. This highlights the need for this
benchmark to foster future AI advancements.