CodeArena: LLM 코드 생성을 위한 집단 평가 플랫폼

초록

대형 언어 모델(LLMs)은 자연어와 프로그래밍 구문에 대한 탁월한 이해력을 결합하여 코드 생성을 혁신적으로 변화시켰으며, 이를 통해 개발자 생산성을 크게 향상시켰습니다. 이러한 발전은 LLM의 코딩 능력을 정량적으로 평가하려는 수많은 노력을 촉발시켰습니다. 그러나 벤치마크 누출, 데이터 소산, 제한된 시스템 접근성과 같은 지속적인 문제들은 적시에 정확한 평가를 방해하고 있습니다. 이러한 한계를 해결하기 위해, 우리는 LLM 코드 생성을 위한 온라인 평가 프레임워크인 CodeArena를 소개합니다. 주요 혁신은 집단 평가 메커니즘으로, 이는 모든 참여 모델의 전반적인 성능을 기반으로 개별 모델 점수를 동적으로 재조정하여 널리 퍼진 벤치마크 누출로 인한 점수 편향을 완화합니다. 또한, CodeArena는 제출된 모든 솔루션과 테스트 케이스에 대한 공개 접근을 보장하고, 코드 평가 워크플로우를 간소화하기 위해 자동화 친화적인 API를 제공합니다. 우리의 주요 기여는 다음과 같습니다: (1) 편향 없는 평가를 위한 집단 평가 시스템, (2) 솔루션 및 테스트 케이스의 공개 저장소, (3) 원활한 통합을 위한 자동화 준비 API.

English

Large Language Models (LLMs) have reshaped code generation by synergizing their exceptional comprehension of natural language and programming syntax, thereby substantially boosting developer productivity. These advancements have prompted numerous efforts to quantitatively evaluate their coding capabilities. However, persistent challenges, such as benchmark leakage, data dissipation, and limited system accessibility, continue to impede a timely and accurate assessment. To address these limitations, we introduce CodeArena, an online evaluation framework tailored for LLM code generation. The key innovation is a collective evaluation mechanism, which dynamically recalibrates individual model scores based on the holistic performance of all participating models, mitigating score biases caused by widespread benchmark leakage. In addition, CodeArena ensures open access to all submitted solutions and test cases and provides automation-friendly APIs to streamline the code evaluation workflow. Our main contributions are: (1) a collective evaluation system for unbiased assessment, (2) a public repository of solutions and test cases, and (3) automation-ready APIs for seamless integration.

CodeArena: LLM 코드 생성을 위한 집단 평가 플랫폼

CodeArena: A Collective Evaluation Platform for LLM Code Generation

초록

Support