CodeArena：大型語言模型代碼生成的集體評估平台

摘要

大型語言模型（LLMs）通過融合其對自然語言和程式語法的卓越理解，重塑了程式碼生成領域，從而大幅提升了開發者的生產力。這些進步促使了眾多努力來定量評估其編碼能力。然而，持續存在的挑戰，如基準測試洩漏、數據消散和系統可訪問性有限，仍然阻礙著及時且準確的評估。為了解決這些限制，我們引入了CodeArena，這是一個專為LLM程式碼生成設計的在線評估框架。其關鍵創新在於集體評估機制，該機制根據所有參與模型的整體表現動態重新校準個別模型的分數，從而減輕因廣泛基準測試洩漏引起的分數偏差。此外，CodeArena確保所有提交的解決方案和測試案例的公開訪問，並提供自動化友好的API以簡化程式碼評估工作流程。我們的主要貢獻包括：（1）一個用於無偏評估的集體評估系統，（2）一個公開的解決方案和測試案例存儲庫，以及（3）自動化就緒的API以實現無縫集成。

English

Large Language Models (LLMs) have reshaped code generation by synergizing their exceptional comprehension of natural language and programming syntax, thereby substantially boosting developer productivity. These advancements have prompted numerous efforts to quantitatively evaluate their coding capabilities. However, persistent challenges, such as benchmark leakage, data dissipation, and limited system accessibility, continue to impede a timely and accurate assessment. To address these limitations, we introduce CodeArena, an online evaluation framework tailored for LLM code generation. The key innovation is a collective evaluation mechanism, which dynamically recalibrates individual model scores based on the holistic performance of all participating models, mitigating score biases caused by widespread benchmark leakage. In addition, CodeArena ensures open access to all submitted solutions and test cases and provides automation-friendly APIs to streamline the code evaluation workflow. Our main contributions are: (1) a collective evaluation system for unbiased assessment, (2) a public repository of solutions and test cases, and (3) automation-ready APIs for seamless integration.

CodeArena：大型語言模型代碼生成的集體評估平台

CodeArena: A Collective Evaluation Platform for LLM Code Generation

摘要

Support