ALE-Bench：面向长时程目标驱动算法工程的基准测试平台

摘要

在包裹配送路线规划、机组调度、工厂生产计划以及电网平衡等领域的复杂优化问题中，人工智能系统的算法工程表现如何？我们推出了ALE-Bench，这是一个用于评估AI系统在基于分数的算法编程竞赛中表现的新基准。ALE-Bench借鉴了AtCoder启发式竞赛中的真实任务，提出了计算难度高且尚无已知精确解的优化问题。与短时间、通过/失败式的编程基准不同，ALE-Bench鼓励在长时间跨度内进行迭代式的解决方案优化。我们的软件框架支持利用测试运行反馈和可视化功能的交互式代理架构。对前沿大语言模型（LLMs）的评估显示，尽管它们在特定问题上表现出色，但在跨问题的一致性和长期问题解决能力方面，与人类相比仍存在显著差距。这凸显了该基准在推动未来AI进步方面的必要性。

English

How well do AI systems perform in algorithm engineering for hard optimization problems in domains such as package-delivery routing, crew scheduling, factory production planning, and power-grid balancing? We introduce ALE-Bench, a new benchmark for evaluating AI systems on score-based algorithmic programming contests. Drawing on real tasks from the AtCoder Heuristic Contests, ALE-Bench presents optimization problems that are computationally hard and admit no known exact solution. Unlike short-duration, pass/fail coding benchmarks, ALE-Bench encourages iterative solution refinement over long time horizons. Our software framework supports interactive agent architectures that leverage test-run feedback and visualizations. Our evaluation of frontier LLMs revealed that while they demonstrate high performance on specific problems, a notable gap remains compared to humans in terms of consistency across problems and long-horizon problem-solving capabilities. This highlights the need for this benchmark to foster future AI advancements.

ALE-Bench：面向长时程目标驱动算法工程的基准测试平台

ALE-Bench: A Benchmark for Long-Horizon Objective-Driven Algorithm Engineering

摘要

Support