CodeArena:大型語言模型代碼生成的集體評估平台
CodeArena: A Collective Evaluation Platform for LLM Code Generation
March 3, 2025
作者: Mingzhe Du, Anh Tuan Luu, Bin Ji, Xiaobao Wu, Dong Huang, Terry Yue Zhuo, Qian Liu, See-Kiong Ng
cs.AI
摘要
大型語言模型(LLMs)通過融合其對自然語言和程式語法的卓越理解,重塑了程式碼生成領域,從而大幅提升了開發者的生產力。這些進步促使了眾多努力來定量評估其編碼能力。然而,持續存在的挑戰,如基準測試洩漏、數據消散和系統可訪問性有限,仍然阻礙著及時且準確的評估。為了解決這些限制,我們引入了CodeArena,這是一個專為LLM程式碼生成設計的在線評估框架。其關鍵創新在於集體評估機制,該機制根據所有參與模型的整體表現動態重新校準個別模型的分數,從而減輕因廣泛基準測試洩漏引起的分數偏差。此外,CodeArena確保所有提交的解決方案和測試案例的公開訪問,並提供自動化友好的API以簡化程式碼評估工作流程。我們的主要貢獻包括:(1)一個用於無偏評估的集體評估系統,(2)一個公開的解決方案和測試案例存儲庫,以及(3)自動化就緒的API以實現無縫集成。
English
Large Language Models (LLMs) have reshaped code generation by synergizing
their exceptional comprehension of natural language and programming syntax,
thereby substantially boosting developer productivity. These advancements have
prompted numerous efforts to quantitatively evaluate their coding capabilities.
However, persistent challenges, such as benchmark leakage, data dissipation,
and limited system accessibility, continue to impede a timely and accurate
assessment. To address these limitations, we introduce CodeArena, an online
evaluation framework tailored for LLM code generation. The key innovation is a
collective evaluation mechanism, which dynamically recalibrates individual
model scores based on the holistic performance of all participating models,
mitigating score biases caused by widespread benchmark leakage. In addition,
CodeArena ensures open access to all submitted solutions and test cases and
provides automation-friendly APIs to streamline the code evaluation workflow.
Our main contributions are: (1) a collective evaluation system for unbiased
assessment, (2) a public repository of solutions and test cases, and (3)
automation-ready APIs for seamless integration.Summary
AI-Generated Summary