LiveCodeBench Pro: 올림피아드 메달리스트들은 경쟁 프로그래밍에서 LLMs를 어떻게 평가하는가?

초록

최근 보고서에 따르면 대형 언어 모델(LLM)이 경쟁 프로그래밍 분야에서 엘리트 인간을 능가한다고 주장한다. 국제 알고리즘 대회 메달리스트 그룹의 지식을 바탕으로, 우리는 이 주장을 재검토하며 LLM이 인간 전문가와 어떻게 다른지, 그리고 여전히 남아 있는 한계는 어디에 있는지 살펴본다. 우리는 Codeforces, ICPC, IOI의 문제로 구성된 LiveCodeBench Pro라는 벤치마크를 소개한다. 이 벤치마크는 데이터 오염 가능성을 줄이기 위해 지속적으로 업데이트된다. 올림피아드 메달리스트 팀은 모든 문제를 알고리즘 범주별로 주석 처리하고, 모델이 생성한 실패한 제출물을 한 줄씩 분석한다. 이 새로운 데이터와 벤치마크를 사용하여 우리는 최첨단 모델들이 여전히 상당한 한계를 가지고 있음을 발견했다: 외부 도구 없이 최고의 모델은 중간 난이도 문제에서 53%의 pass@1을 달성했으며, 어려운 문제에서는 0%를 기록했다. 이는 인간 전문가들이 여전히 뛰어난 분야이다. 또한 우리는 LLM이 구현이 많이 필요한 문제에서는 성공하지만, 미묘한 알고리즘적 추론과 복잡한 사례 분석에서는 어려움을 겪으며, 종종 자신 있게 잘못된 정당화를 생성한다는 것을 발견했다. 높은 성과는 주로 구현 정밀도와 도구 보강에 의해 주도되는 것으로 보이며, 우수한 추론 능력 때문이 아니다. 따라서 LiveCodeBench Pro는 인간 그랜드마스터 수준과의 상당한 격차를 강조하면서, 코드 중심 LLM 추론의 미래 개선을 이끌기 위한 세분화된 진단을 제공한다.

English

Recent reports claim that large language models (LLMs) now outperform elite humans in competitive programming. Drawing on knowledge from a group of medalists in international algorithmic contests, we revisit this claim, examining how LLMs differ from human experts and where limitations still remain. We introduce LiveCodeBench Pro, a benchmark composed of problems from Codeforces, ICPC, and IOI that are continuously updated to reduce the likelihood of data contamination. A team of Olympiad medalists annotates every problem for algorithmic categories and conducts a line-by-line analysis of failed model-generated submissions. Using this new data and benchmark, we find that frontier models still have significant limitations: without external tools, the best model achieves only 53% pass@1 on medium-difficulty problems and 0% on hard problems, domains where expert humans still excel. We also find that LLMs succeed at implementation-heavy problems but struggle with nuanced algorithmic reasoning and complex case analysis, often generating confidently incorrect justifications. High performance appears largely driven by implementation precision and tool augmentation, not superior reasoning. LiveCodeBench Pro thus highlights the significant gap to human grandmaster levels, while offering fine-grained diagnostics to steer future improvements in code-centric LLM reasoning.

LiveCodeBench Pro: 올림피아드 메달리스트들은 경쟁 프로그래밍에서 LLMs를 어떻게 평가하는가?

LiveCodeBench Pro: How Do Olympiad Medalists Judge LLMs in Competitive Programming?

초록

Support