CodeArena: LLMコード生成のための集団評価プラットフォーム

要旨

大規模言語モデル（LLMs）は、自然言語とプログラミング構文に対する優れた理解力を統合することで、コード生成のあり方を変革し、開発者の生産性を大幅に向上させてきました。これらの進歩により、LLMsのコーディング能力を定量的に評価するための数多くの取り組みが行われています。しかし、ベンチマーク漏洩、データ散逸、システムアクセスの制限といった課題が依然として存在し、迅速かつ正確な評価を妨げています。これらの制約に対処するため、我々はLLMコード生成に特化したオンライン評価フレームワーク「CodeArena」を提案します。その中核となる革新は、集団評価メカニズムであり、参加モデル全体のパフォーマンスに基づいて個々のモデルのスコアを動的に再調整することで、広範なベンチマーク漏洩によるスコアの偏りを軽減します。さらに、CodeArenaは、提出されたすべてのソリューションとテストケースへのオープンアクセスを保証し、コード評価ワークフローを効率化する自動化対応のAPIを提供します。我々の主な貢献は以下の通りです：(1) 偏りのない評価のための集団評価システム、(2) ソリューションとテストケースの公開リポジトリ、(3) シームレスな統合のための自動化対応API。

English

Large Language Models (LLMs) have reshaped code generation by synergizing their exceptional comprehension of natural language and programming syntax, thereby substantially boosting developer productivity. These advancements have prompted numerous efforts to quantitatively evaluate their coding capabilities. However, persistent challenges, such as benchmark leakage, data dissipation, and limited system accessibility, continue to impede a timely and accurate assessment. To address these limitations, we introduce CodeArena, an online evaluation framework tailored for LLM code generation. The key innovation is a collective evaluation mechanism, which dynamically recalibrates individual model scores based on the holistic performance of all participating models, mitigating score biases caused by widespread benchmark leakage. In addition, CodeArena ensures open access to all submitted solutions and test cases and provides automation-friendly APIs to streamline the code evaluation workflow. Our main contributions are: (1) a collective evaluation system for unbiased assessment, (2) a public repository of solutions and test cases, and (3) automation-ready APIs for seamless integration.

CodeArena: LLMコード生成のための集団評価プラットフォーム

CodeArena: A Collective Evaluation Platform for LLM Code Generation

要旨

Support