AetherCode: 프리미어 프로그래밍 대회에서 우승할 수 있는 LLM의 능력 평가

초록

경쟁 프로그래밍은 대규모 언어 모델(LLM)의 추론 및 코딩 능력을 평가하는 중요한 벤치마크로 부상했습니다. 기존 벤치마크에서의 인상적인 진전에도 불구하고, 우리는 현재의 평가가 모델의 숙련도를 과대평가하여 LLM과 엘리트 인간 프로그래머 간의 상당한 격차를 가리고 있다고 주장합니다. 이 격차는 두 가지 주요 한계에서 비롯됩니다: 벤치마크 문제의 난이도와 범위가 불충분하다는 점, 그리고 저품질 테스트 케이스로 인한 평가 편향입니다. 이러한 단점을 해결하기 위해, 우리는 IOI와 ICPC와 같은 주요 프로그래밍 대회에서 문제를 가져와 더 넓은 범위와 높은 난이도를 제공하는 새로운 벤치마크인 AetherCode를 제시합니다. AetherCode는 자동 생성과 인간 검증을 결합한 포괄적이고 전문가 검증된 테스트 스위트를 추가로 통합하여 엄격하고 신뢰할 수 있는 평가를 보장합니다. 도전적인 문제 설계와 견고한 평가를 결합함으로써, AetherCode는 LLM의 능력을 더 정확하게 측정하고 코드 추론 분야의 미래 연구를 위한 새로운 기준을 제시합니다.

English

Competitive programming has emerged as a critical benchmark for evaluating the reasoning and coding capabilities of Large Language Models (LLMs). Despite impressive progress on existing benchmarks, we argue that current evaluations overstate model proficiency, masking a substantial gap between LLMs and elite human programmers. This gap arises from two key limitations: insufficient difficulty and scope of benchmark problems, and evaluation bias from low-quality test cases. To address these shortcomings, we present AetherCode, a new benchmark that draws problems from premier programming competitions such as IOI and ICPC, offering broader coverage and higher difficulty. AetherCode further incorporates comprehensive, expert-validated test suites built through a hybrid of automated generation and human curation, ensuring rigorous and reliable assessment. By combining challenging problem design with robust evaluation, AetherCode provides a more faithful measure of LLM capabilities and sets a new standard for future research in code reasoning.

AetherCode: 프리미어 프로그래밍 대회에서 우승할 수 있는 LLM의 능력 평가

AetherCode: Evaluating LLMs' Ability to Win In Premier Programming Competitions

초록

Support