AetherCode：評估大型語言模型在頂尖程式設計競賽中的獲勝能力

摘要

競技程式設計已成為評估大型語言模型（LLMs）推理與編碼能力的關鍵基準。儘管現有基準測試取得了令人矚目的進展，我們認為當前的評估過於誇大了模型的熟練程度，掩蓋了LLMs與頂尖人類程式設計師之間的顯著差距。這一差距源於兩個關鍵限制：基準測試問題的難度與範圍不足，以及低品質測試案例帶來的評估偏差。為解決這些不足，我們提出了AetherCode，這是一個新的基準測試，它取材於IOI和ICPC等頂級程式設計競賽，提供了更廣泛的覆蓋面和更高的難度。AetherCode進一步整合了通過自動生成與人工審核相結合的方式構建的全面、專家驗證的測試套件，確保了評估的嚴謹性和可靠性。通過結合具有挑戰性的問題設計與穩健的評估，AetherCode為LLM能力提供了更真實的衡量標準，並為未來程式碼推理研究設立了新的標準。

English

Competitive programming has emerged as a critical benchmark for evaluating the reasoning and coding capabilities of Large Language Models (LLMs). Despite impressive progress on existing benchmarks, we argue that current evaluations overstate model proficiency, masking a substantial gap between LLMs and elite human programmers. This gap arises from two key limitations: insufficient difficulty and scope of benchmark problems, and evaluation bias from low-quality test cases. To address these shortcomings, we present AetherCode, a new benchmark that draws problems from premier programming competitions such as IOI and ICPC, offering broader coverage and higher difficulty. AetherCode further incorporates comprehensive, expert-validated test suites built through a hybrid of automated generation and human curation, ensuring rigorous and reliable assessment. By combining challenging problem design with robust evaluation, AetherCode provides a more faithful measure of LLM capabilities and sets a new standard for future research in code reasoning.

AetherCode：評估大型語言模型在頂尖程式設計競賽中的獲勝能力

AetherCode: Evaluating LLMs' Ability to Win In Premier Programming Competitions

摘要

Support