AetherCode: トッププログラミングコンテストでの勝利を目指すLLMの能力評価

要旨

競技プログラミングは、大規模言語モデル（LLM）の推論能力とコーディング能力を評価するための重要なベンチマークとして浮上しています。既存のベンチマークでは目覚ましい進展が見られるものの、現在の評価はモデルの熟練度を過大評価しており、LLMとエリート人間プログラマーの間には依然として大きなギャップが存在します。このギャップは、主に2つの重要な制約に起因しています。1つは、ベンチマーク問題の難易度と範囲が不十分であること、もう1つは、低品質なテストケースによる評価バイアスです。これらの欠点を解消するため、私たちはAetherCodeを提案します。AetherCodeは、IOIやICPCなどの一流プログラミングコンテストから問題を選び、より広範なカバレッジと高い難易度を提供します。さらに、自動生成と人間による精選を組み合わせた包括的で専門家による検証済みのテストスイートを組み込むことで、厳密かつ信頼性の高い評価を実現します。挑戦的な問題設計と堅牢な評価を組み合わせることで、AetherCodeはLLMの能力をより忠実に測定し、コード推論に関する将来の研究の新たな基準を設定します。

English

Competitive programming has emerged as a critical benchmark for evaluating the reasoning and coding capabilities of Large Language Models (LLMs). Despite impressive progress on existing benchmarks, we argue that current evaluations overstate model proficiency, masking a substantial gap between LLMs and elite human programmers. This gap arises from two key limitations: insufficient difficulty and scope of benchmark problems, and evaluation bias from low-quality test cases. To address these shortcomings, we present AetherCode, a new benchmark that draws problems from premier programming competitions such as IOI and ICPC, offering broader coverage and higher difficulty. AetherCode further incorporates comprehensive, expert-validated test suites built through a hybrid of automated generation and human curation, ensuring rigorous and reliable assessment. By combining challenging problem design with robust evaluation, AetherCode provides a more faithful measure of LLM capabilities and sets a new standard for future research in code reasoning.

AetherCode: トッププログラミングコンテストでの勝利を目指すLLMの能力評価

AetherCode: Evaluating LLMs' Ability to Win In Premier Programming Competitions

要旨

Support