BigCodeArena: 実行によるコード生成におけるより信頼性の高い人間の選好の解明

要旨

Chatbot Arenaのようなクラウドソーシング型モデル評価プラットフォームは、人間の視点からリアルタイムでモデルの応答品質を評価することを可能にします。コーディング領域において、LLM（大規模言語モデル）が生成したコンテンツの品質を手動で検証することは非常に困難です。なぜなら、長い生のコードを理解し、意図的にコードの実行をシミュレートする必要があるためです。この課題に対処するため、我々はBigCodeArenaを紹介します。これは、包括的かつ即時の実行環境を備えた、コード生成のためのオープンな人間評価プラットフォームです。Chatbot Arenaを基盤として構築されたBigCodeArenaは、LLMが生成したコードの実行を可能にし、人間が実行プロセスとその結果と対話することを可能にします。我々は、10の広く使用されているLLMにわたる14,000以上のコード中心の会話セッションを収集し、10の言語と8種類の実行環境にまたがるデータを集めました。これらの会話の中から、4,700以上のマルチターンサンプルをペアワイズ人間選好として特定しました。さらに分析を進めることで、タスク、言語、フレームワークによって特徴づけられる細粒度の領域におけるLLMの未開拓の選好を明らかにしました。最先端のLLMのコード理解と生成能力を体系的に検証するため、収集したデータに基づいて2つのベンチマーク、BigCodeRewardとAutoCodeArenaを策定しました。BigCodeRewardでは、4,700の会話を後処理し、報酬モデルと人間の選好の一貫性を評価しました。評価の結果、実行結果が利用可能な場合、ほとんどのLLMがコーディング選好の判断において優れた性能を発揮することが示されました。これらの知見に基づき、我々はAutoCodeArenaを提案します。これは、人間の介入なしにLLMのコーディング品質を評価するための自動Eloレーティングベンチマークです。GPT-5、Claude-Sonnet-4、Claude-Opus-4のようなプロプライエタリなLLMは、最近登場したモデルの中でも依然としてコード生成性能でリードしていることがわかりました。

English

Crowdsourced model evaluation platforms, such as Chatbot Arena, enable real-time evaluation from human perspectives to assess the quality of model responses. In the coding domain, manually examining the quality of LLM-generated content is extremely challenging, as it requires understanding long chunks of raw code and deliberately simulating code execution. To this end, we introduce BigCodeArena, an open human evaluation platform for code generation backed by a comprehensive and on-the-fly execution environment. Built on top of Chatbot Arena, BigCodeArena enables the execution of LLM-generated code and allows humans to interact with the execution process and outcomes. We collected over 14,000 raw code-centric conversation sessions across 10 widely used LLMs, spanning 10 languages and 8 types of execution environments. Among these conversations, we identified more than 4,700 multi-turn samples with pairwise human preferences. Further analysis uncovers underexplored preferences of LLMs in fine-grained domains characterized by tasks, languages, and frameworks. To systematically examine code understanding and generation capabilities of frontier LLMs, we curated two benchmarks based on the collected data, namely BigCodeReward and AutoCodeArena. For BigCodeReward, we post-processed the 4,700 conversations and evaluated the consistency between reward models and human preferences. The evaluation shows that most LLMs have superior performance in judging coding preferences when the execution results are available. Inspired by these findings, we propose AutoCodeArena, an automatic Elo rating benchmark designed to assess the coding quality of LLMs without human involvement. We find that proprietary LLMs like GPT-5, Claude-Sonnet-4, and Claude-Opus-4 still lead in code generation performance among recent emerging models.

BigCodeArena: 実行によるコード生成におけるより信頼性の高い人間の選好の解明

BigCodeArena: Unveiling More Reliable Human Preferences in Code Generation via Execution

要旨

Support