ALE-Bench:面向长程目标驱动算法工程的基准测试平台
ALE-Bench: A Benchmark for Long-Horizon Objective-Driven Algorithm Engineering
June 10, 2025
作者: Yuki Imajuku, Kohki Horie, Yoichi Iwata, Kensho Aoki, Naohiro Takahashi, Takuya Akiba
cs.AI
摘要
在包裹配送路線規劃、機組人員排班、工廠生產計劃以及電網平衡等領域的困難優化問題上,人工智慧系統的演算法工程表現如何?我們引入了ALE-Bench,這是一個新的基準測試,用於評估基於分數的演算法程式設計競賽中的人工智慧系統。ALE-Bench借鑒了AtCoder啟發式競賽中的真實任務,提出了計算上困難且目前尚無已知精確解的優化問題。與短時間內以通過/失敗為評判標準的編碼基準不同,ALE-Bench鼓勵在長時間範圍內進行迭代式的解決方案改進。我們的軟體框架支援利用測試運行反饋和視覺化的互動式代理架構。我們對前沿大型語言模型的評估顯示,儘管它們在特定問題上表現出色,但在跨問題的一致性和長期問題解決能力方面,與人類相比仍存在顯著差距。這凸顯了該基準測試在推動未來人工智慧進步方面的必要性。
English
How well do AI systems perform in algorithm engineering for hard optimization
problems in domains such as package-delivery routing, crew scheduling, factory
production planning, and power-grid balancing? We introduce ALE-Bench, a new
benchmark for evaluating AI systems on score-based algorithmic programming
contests. Drawing on real tasks from the AtCoder Heuristic Contests, ALE-Bench
presents optimization problems that are computationally hard and admit no known
exact solution. Unlike short-duration, pass/fail coding benchmarks, ALE-Bench
encourages iterative solution refinement over long time horizons. Our software
framework supports interactive agent architectures that leverage test-run
feedback and visualizations. Our evaluation of frontier LLMs revealed that
while they demonstrate high performance on specific problems, a notable gap
remains compared to humans in terms of consistency across problems and
long-horizon problem-solving capabilities. This highlights the need for this
benchmark to foster future AI advancements.