ALE-Bench: 장기 목표 지향 알고리즘 엔지니어링을 위한 벤치마크

초록

패키지 배송 경로 최적화, 승무원 스케줄링, 공장 생산 계획, 전력망 균형 조정과 같은 도메인의 어려운 최적화 문제에 대해 AI 시스템이 알고리즘 엔지니어링에서 얼마나 잘 수행하는가? 우리는 점수 기반 알고리즘 프로그래밍 대회에서 AI 시스템을 평가하기 위한 새로운 벤치마크인 ALE-Bench를 소개한다. AtCoder Heuristic Contests의 실제 과제를 기반으로 한 ALE-Bench는 계산적으로 어렵고 알려진 정확한 해결책이 없는 최적화 문제를 제시한다. 단기간의 합격/불합격 코딩 벤치마크와 달리, ALE-Bench는 장기간에 걸친 반복적인 솔루션 개선을 장려한다. 우리의 소프트웨어 프레임워크는 테스트 실행 피드백과 시각화를 활용하는 인터랙티브 에이전트 아키텍처를 지원한다. 최첨단 대형 언어 모델(LLM)에 대한 평가 결과, 특정 문제에서는 높은 성능을 보였지만, 문제 간 일관성과 장기적인 문제 해결 능력 측면에서 인간과 비교했을 때 여전히 상당한 격차가 있음이 드러났다. 이는 향후 AI 발전을 촉진하기 위해 이 벤치마크가 필요함을 강조한다.

English

How well do AI systems perform in algorithm engineering for hard optimization problems in domains such as package-delivery routing, crew scheduling, factory production planning, and power-grid balancing? We introduce ALE-Bench, a new benchmark for evaluating AI systems on score-based algorithmic programming contests. Drawing on real tasks from the AtCoder Heuristic Contests, ALE-Bench presents optimization problems that are computationally hard and admit no known exact solution. Unlike short-duration, pass/fail coding benchmarks, ALE-Bench encourages iterative solution refinement over long time horizons. Our software framework supports interactive agent architectures that leverage test-run feedback and visualizations. Our evaluation of frontier LLMs revealed that while they demonstrate high performance on specific problems, a notable gap remains compared to humans in terms of consistency across problems and long-horizon problem-solving capabilities. This highlights the need for this benchmark to foster future AI advancements.

ALE-Bench: 장기 목표 지향 알고리즘 엔지니어링을 위한 벤치마크

ALE-Bench: A Benchmark for Long-Horizon Objective-Driven Algorithm Engineering

초록

Support