ALE-Bench: 長期的目標指向型アルゴリズム工学のためのベンチマーク

要旨

AIシステムは、宅配ルート最適化、乗務員スケジューリング、工場生産計画、電力網バランシングなどの難しい最適化問題に対するアルゴリズム工学において、どの程度の性能を発揮するのでしょうか？本論文では、スコアベースのアルゴリズムプログラミングコンテストにおけるAIシステムの評価を行う新しいベンチマーク「ALE-Bench」を紹介します。ALE-Benchは、AtCoder Heuristic Contestsの実際のタスクを基に、計算量的に難しく、既知の厳密解が存在しない最適化問題を提供します。短時間の合否判定型コーディングベンチマークとは異なり、ALE-Benchは長期的な時間軸での反復的な解法改善を促進します。私たちのソフトウェアフレームワークは、テスト実行のフィードバックや可視化を活用するインタラクティブなエージェントアーキテクチャをサポートしています。最先端の大規模言語モデル（LLM）を評価した結果、特定の問題では高い性能を示すものの、問題間での一貫性や長期的な問題解決能力において人間との間に顕著なギャップが残ることが明らかになりました。これは、今後のAI進化を促進するためにこのベンチマークが必要であることを示唆しています。

English

How well do AI systems perform in algorithm engineering for hard optimization problems in domains such as package-delivery routing, crew scheduling, factory production planning, and power-grid balancing? We introduce ALE-Bench, a new benchmark for evaluating AI systems on score-based algorithmic programming contests. Drawing on real tasks from the AtCoder Heuristic Contests, ALE-Bench presents optimization problems that are computationally hard and admit no known exact solution. Unlike short-duration, pass/fail coding benchmarks, ALE-Bench encourages iterative solution refinement over long time horizons. Our software framework supports interactive agent architectures that leverage test-run feedback and visualizations. Our evaluation of frontier LLMs revealed that while they demonstrate high performance on specific problems, a notable gap remains compared to humans in terms of consistency across problems and long-horizon problem-solving capabilities. This highlights the need for this benchmark to foster future AI advancements.

ALE-Bench: 長期的目標指向型アルゴリズム工学のためのベンチマーク

ALE-Bench: A Benchmark for Long-Horizon Objective-Driven Algorithm Engineering

要旨

Support