BigCodeArena：通过执行揭示代码生成中更可靠的人类偏好

摘要

众包模型评估平台，如Chatbot Arena，能够从人类视角进行实时评估，以衡量模型响应的质量。在编程领域，手动检查大语言模型（LLM）生成内容的质量极具挑战性，因为这需要理解大段原始代码并有意模拟代码执行过程。为此，我们推出了BigCodeArena，一个开放的代码生成人类评估平台，它依托于一个全面且即时执行的运行环境。基于Chatbot Arena构建，BigCodeArena能够执行LLM生成的代码，并允许人类与执行过程及结果互动。我们收集了超过14,000条以代码为中心的原始对话会话，涉及10种广泛使用的LLM，覆盖10种编程语言和8种执行环境类型。在这些对话中，我们识别出超过4,700个包含成对人类偏好的多轮样本。进一步分析揭示了LLM在由任务、语言和框架定义的细粒度领域中尚未被充分探索的偏好。为了系统性地检验前沿LLM的代码理解与生成能力，我们基于收集的数据精心设计了两项基准测试：BigCodeReward和AutoCodeArena。对于BigCodeReward，我们对4,700次对话进行了后处理，评估了奖励模型与人类偏好之间的一致性。评估结果显示，当执行结果可得时，多数LLM在判断编码偏好方面表现优异。受此启发，我们提出了AutoCodeArena，一个自动化的Elo评分基准，旨在无需人类参与的情况下评估LLM的代码生成质量。我们发现，在近期涌现的模型中，如GPT-5、Claude-Sonnet-4和Claude-Opus-4等专有LLM在代码生成性能上仍处于领先地位。

English

Crowdsourced model evaluation platforms, such as Chatbot Arena, enable real-time evaluation from human perspectives to assess the quality of model responses. In the coding domain, manually examining the quality of LLM-generated content is extremely challenging, as it requires understanding long chunks of raw code and deliberately simulating code execution. To this end, we introduce BigCodeArena, an open human evaluation platform for code generation backed by a comprehensive and on-the-fly execution environment. Built on top of Chatbot Arena, BigCodeArena enables the execution of LLM-generated code and allows humans to interact with the execution process and outcomes. We collected over 14,000 raw code-centric conversation sessions across 10 widely used LLMs, spanning 10 languages and 8 types of execution environments. Among these conversations, we identified more than 4,700 multi-turn samples with pairwise human preferences. Further analysis uncovers underexplored preferences of LLMs in fine-grained domains characterized by tasks, languages, and frameworks. To systematically examine code understanding and generation capabilities of frontier LLMs, we curated two benchmarks based on the collected data, namely BigCodeReward and AutoCodeArena. For BigCodeReward, we post-processed the 4,700 conversations and evaluated the consistency between reward models and human preferences. The evaluation shows that most LLMs have superior performance in judging coding preferences when the execution results are available. Inspired by these findings, we propose AutoCodeArena, an automatic Elo rating benchmark designed to assess the coding quality of LLMs without human involvement. We find that proprietary LLMs like GPT-5, Claude-Sonnet-4, and Claude-Opus-4 still lead in code generation performance among recent emerging models.

BigCodeArena：通过执行揭示代码生成中更可靠的人类偏好

BigCodeArena: Unveiling More Reliable Human Preferences in Code Generation via Execution

摘要

Support