LiveCodeBench Pro: 競技プログラミングにおけるオリンピアードメダリストはLLMをどのように評価するか？

要旨

最近の報告によると、大規模言語モデル（LLM）は競技プログラミングにおいてエリート人間を凌駕するようになったとされている。国際的なアルゴリズムコンテストのメダリストたちの知見を基に、この主張を再検証し、LLMが人間の専門家とどのように異なり、どのような限界が依然として残っているかを考察する。本論文では、Codeforces、ICPC、IOIから問題を集め、データ汚染の可能性を低減するために継続的に更新されるベンチマーク「LiveCodeBench Pro」を紹介する。オリンピアードメダリストのチームが各問題をアルゴリズムカテゴリごとに注釈付けし、モデル生成の失敗した提出物を一行ずつ分析する。この新しいデータとベンチマークを用いて、最先端のモデルには依然として重大な限界があることが明らかになった：外部ツールなしでは、最良のモデルでも中程度の難易度の問題で53%のpass@1を達成するに留まり、難易度の高い問題では0%であり、これらの領域では人間の専門家が依然として優れている。また、LLMは実装が重い問題では成功するが、微妙なアルゴリズム的推論や複雑なケース分析には苦戦し、しばしば自信を持って誤った正当化を生成することがわかった。高いパフォーマンスは、主に実装の精度とツールの拡張によってもたらされており、優れた推論能力によるものではない。したがって、LiveCodeBench Proは、人間のグランドマスターレベルとの大きな隔たりを浮き彫りにしつつ、コード中心のLLM推論の将来の改善を導くための詳細な診断を提供する。

English

Recent reports claim that large language models (LLMs) now outperform elite humans in competitive programming. Drawing on knowledge from a group of medalists in international algorithmic contests, we revisit this claim, examining how LLMs differ from human experts and where limitations still remain. We introduce LiveCodeBench Pro, a benchmark composed of problems from Codeforces, ICPC, and IOI that are continuously updated to reduce the likelihood of data contamination. A team of Olympiad medalists annotates every problem for algorithmic categories and conducts a line-by-line analysis of failed model-generated submissions. Using this new data and benchmark, we find that frontier models still have significant limitations: without external tools, the best model achieves only 53% pass@1 on medium-difficulty problems and 0% on hard problems, domains where expert humans still excel. We also find that LLMs succeed at implementation-heavy problems but struggle with nuanced algorithmic reasoning and complex case analysis, often generating confidently incorrect justifications. High performance appears largely driven by implementation precision and tool augmentation, not superior reasoning. LiveCodeBench Pro thus highlights the significant gap to human grandmaster levels, while offering fine-grained diagnostics to steer future improvements in code-centric LLM reasoning.

LiveCodeBench Pro: 競技プログラミングにおけるオリンピアードメダリストはLLMをどのように評価するか？

LiveCodeBench Pro: How Do Olympiad Medalists Judge LLMs in Competitive Programming?

要旨

Support