ChatPaper.aiChatPaper

AetherCode:評估大型語言模型在頂尖程式設計競賽中的獲勝能力

AetherCode: Evaluating LLMs' Ability to Win In Premier Programming Competitions

August 22, 2025
作者: Zihan Wang, Jiaze Chen, Zhicheng Liu, Markus Mak, Yidi Du, Geonsik Moon, Luoqi Xu, Aaron Tua, Kunshuo Peng, Jiayi Lu, Mingfei Xia, Boqian Zou, Chenyang Ran, Guang Tian, Shoutai Zhu, Yeheng Duan, Zhenghui Kang, Zhenxing Lin, Shangshu Li, Qiang Luo, Qingshen Long, Zhiyong Chen, Yihan Xiao, Yurong Wu, Daoguang Zan, Yuyi Fu, Mingxuan Wang, Ming Ding
cs.AI

摘要

競技程式設計已成為評估大型語言模型(LLMs)推理與編碼能力的關鍵基準。儘管現有基準測試取得了令人矚目的進展,我們認為當前的評估過於誇大了模型的熟練程度,掩蓋了LLMs與頂尖人類程式設計師之間的顯著差距。這一差距源於兩個關鍵限制:基準測試問題的難度與範圍不足,以及低品質測試案例帶來的評估偏差。為解決這些不足,我們提出了AetherCode,這是一個新的基準測試,它取材於IOI和ICPC等頂級程式設計競賽,提供了更廣泛的覆蓋面和更高的難度。AetherCode進一步整合了通過自動生成與人工審核相結合的方式構建的全面、專家驗證的測試套件,確保了評估的嚴謹性和可靠性。通過結合具有挑戰性的問題設計與穩健的評估,AetherCode為LLM能力提供了更真實的衡量標準,並為未來程式碼推理研究設立了新的標準。
English
Competitive programming has emerged as a critical benchmark for evaluating the reasoning and coding capabilities of Large Language Models (LLMs). Despite impressive progress on existing benchmarks, we argue that current evaluations overstate model proficiency, masking a substantial gap between LLMs and elite human programmers. This gap arises from two key limitations: insufficient difficulty and scope of benchmark problems, and evaluation bias from low-quality test cases. To address these shortcomings, we present AetherCode, a new benchmark that draws problems from premier programming competitions such as IOI and ICPC, offering broader coverage and higher difficulty. AetherCode further incorporates comprehensive, expert-validated test suites built through a hybrid of automated generation and human curation, ensuring rigorous and reliable assessment. By combining challenging problem design with robust evaluation, AetherCode provides a more faithful measure of LLM capabilities and sets a new standard for future research in code reasoning.
PDF102August 25, 2025