AetherCode：评估大语言模型在顶级编程竞赛中的获胜能力

摘要

竞技编程已成为评估大型语言模型（LLMs）推理与编码能力的关键基准。尽管现有基准测试取得了显著进展，但我们认为当前评估高估了模型的实际水平，掩盖了LLMs与顶尖人类程序员之间的显著差距。这一差距源于两大关键局限：基准测试问题难度与广度不足，以及低质量测试用例导致的评估偏差。为弥补这些不足，我们推出了AetherCode，这一新基准从IOI、ICPC等顶级编程竞赛中选取题目，提供了更广泛的覆盖范围与更高的难度。AetherCode进一步整合了通过自动化生成与人工审核相结合构建的全面、专家验证的测试套件，确保了评估的严谨性与可靠性。通过将挑战性的问题设计与稳健的评估相结合，AetherCode为LLM能力提供了更为真实的衡量标准，并为未来代码推理研究树立了新标杆。

English

Competitive programming has emerged as a critical benchmark for evaluating the reasoning and coding capabilities of Large Language Models (LLMs). Despite impressive progress on existing benchmarks, we argue that current evaluations overstate model proficiency, masking a substantial gap between LLMs and elite human programmers. This gap arises from two key limitations: insufficient difficulty and scope of benchmark problems, and evaluation bias from low-quality test cases. To address these shortcomings, we present AetherCode, a new benchmark that draws problems from premier programming competitions such as IOI and ICPC, offering broader coverage and higher difficulty. AetherCode further incorporates comprehensive, expert-validated test suites built through a hybrid of automated generation and human curation, ensuring rigorous and reliable assessment. By combining challenging problem design with robust evaluation, AetherCode provides a more faithful measure of LLM capabilities and sets a new standard for future research in code reasoning.

AetherCode：评估大语言模型在顶级编程竞赛中的获胜能力

AetherCode: Evaluating LLMs' Ability to Win In Premier Programming Competitions

摘要

Support