AetherCode:评估大语言模型在顶级编程竞赛中的获胜能力
AetherCode: Evaluating LLMs' Ability to Win In Premier Programming Competitions
August 22, 2025
作者: Zihan Wang, Jiaze Chen, Zhicheng Liu, Markus Mak, Yidi Du, Geonsik Moon, Luoqi Xu, Aaron Tua, Kunshuo Peng, Jiayi Lu, Mingfei Xia, Boqian Zou, Chenyang Ran, Guang Tian, Shoutai Zhu, Yeheng Duan, Zhenghui Kang, Zhenxing Lin, Shangshu Li, Qiang Luo, Qingshen Long, Zhiyong Chen, Yihan Xiao, Yurong Wu, Daoguang Zan, Yuyi Fu, Mingxuan Wang, Ming Ding
cs.AI
摘要
竞技编程已成为评估大型语言模型(LLMs)推理与编码能力的关键基准。尽管现有基准测试取得了显著进展,但我们认为当前评估高估了模型的实际水平,掩盖了LLMs与顶尖人类程序员之间的显著差距。这一差距源于两大关键局限:基准测试问题难度与广度不足,以及低质量测试用例导致的评估偏差。为弥补这些不足,我们推出了AetherCode,这一新基准从IOI、ICPC等顶级编程竞赛中选取题目,提供了更广泛的覆盖范围与更高的难度。AetherCode进一步整合了通过自动化生成与人工审核相结合构建的全面、专家验证的测试套件,确保了评估的严谨性与可靠性。通过将挑战性的问题设计与稳健的评估相结合,AetherCode为LLM能力提供了更为真实的衡量标准,并为未来代码推理研究树立了新标杆。
English
Competitive programming has emerged as a critical benchmark for evaluating
the reasoning and coding capabilities of Large Language Models (LLMs). Despite
impressive progress on existing benchmarks, we argue that current evaluations
overstate model proficiency, masking a substantial gap between LLMs and elite
human programmers. This gap arises from two key limitations: insufficient
difficulty and scope of benchmark problems, and evaluation bias from
low-quality test cases. To address these shortcomings, we present AetherCode, a
new benchmark that draws problems from premier programming competitions such as
IOI and ICPC, offering broader coverage and higher difficulty. AetherCode
further incorporates comprehensive, expert-validated test suites built through
a hybrid of automated generation and human curation, ensuring rigorous and
reliable assessment. By combining challenging problem design with robust
evaluation, AetherCode provides a more faithful measure of LLM capabilities and
sets a new standard for future research in code reasoning.