BigCodeArena:通过执行揭示代码生成中更可靠的人类偏好
BigCodeArena: Unveiling More Reliable Human Preferences in Code Generation via Execution
October 9, 2025
作者: Terry Yue Zhuo, Xiaolong Jin, Hange Liu, Juyong Jiang, Tianyang Liu, Chen Gong, Bhupesh Bishnoi, Vaisakhi Mishra, Marek Suppa, Noah Ziems, Saiteja Utpala, Ming Xu, Guangyu Song, Kaixin Li, Yuhan Cao, Bo Liu, Zheng Liu, Sabina Abdurakhmanova, Wenhao Yu, Mengzhao Jia, Jihan Yao, Kenneth Hamilton, Kumar Shridhar, Minh Chien Vu, Dingmin Wang, Jiawei Liu, Zijian Wang, Qian Liu, Binyuan Hui, Meg Risdal, Ahsen Khaliq, Atin Sood, Zhenchang Xing, Wasi Uddin Ahmad, John Grundy, David Lo, Banghua Zhu, Xiaoning Du, Torsten Scholak, Leandro von Werra
cs.AI
摘要
众包模型评估平台,如Chatbot Arena,能够从人类视角进行实时评估,以衡量模型响应的质量。在编程领域,手动检查大语言模型(LLM)生成内容的质量极具挑战性,因为这需要理解大段原始代码并有意模拟代码执行过程。为此,我们推出了BigCodeArena,一个开放的代码生成人类评估平台,它依托于一个全面且即时执行的运行环境。基于Chatbot Arena构建,BigCodeArena能够执行LLM生成的代码,并允许人类与执行过程及结果互动。我们收集了超过14,000条以代码为中心的原始对话会话,涉及10种广泛使用的LLM,覆盖10种编程语言和8种执行环境类型。在这些对话中,我们识别出超过4,700个包含成对人类偏好的多轮样本。进一步分析揭示了LLM在由任务、语言和框架定义的细粒度领域中尚未被充分探索的偏好。为了系统性地检验前沿LLM的代码理解与生成能力,我们基于收集的数据精心设计了两项基准测试:BigCodeReward和AutoCodeArena。对于BigCodeReward,我们对4,700次对话进行了后处理,评估了奖励模型与人类偏好之间的一致性。评估结果显示,当执行结果可得时,多数LLM在判断编码偏好方面表现优异。受此启发,我们提出了AutoCodeArena,一个自动化的Elo评分基准,旨在无需人类参与的情况下评估LLM的代码生成质量。我们发现,在近期涌现的模型中,如GPT-5、Claude-Sonnet-4和Claude-Opus-4等专有LLM在代码生成性能上仍处于领先地位。
English
Crowdsourced model evaluation platforms, such as Chatbot Arena, enable
real-time evaluation from human perspectives to assess the quality of model
responses. In the coding domain, manually examining the quality of
LLM-generated content is extremely challenging, as it requires understanding
long chunks of raw code and deliberately simulating code execution. To this
end, we introduce BigCodeArena, an open human evaluation platform for code
generation backed by a comprehensive and on-the-fly execution environment.
Built on top of Chatbot Arena, BigCodeArena enables the execution of
LLM-generated code and allows humans to interact with the execution process and
outcomes. We collected over 14,000 raw code-centric conversation sessions
across 10 widely used LLMs, spanning 10 languages and 8 types of execution
environments. Among these conversations, we identified more than 4,700
multi-turn samples with pairwise human preferences. Further analysis uncovers
underexplored preferences of LLMs in fine-grained domains characterized by
tasks, languages, and frameworks. To systematically examine code understanding
and generation capabilities of frontier LLMs, we curated two benchmarks based
on the collected data, namely BigCodeReward and AutoCodeArena. For
BigCodeReward, we post-processed the 4,700 conversations and evaluated the
consistency between reward models and human preferences. The evaluation shows
that most LLMs have superior performance in judging coding preferences when the
execution results are available. Inspired by these findings, we propose
AutoCodeArena, an automatic Elo rating benchmark designed to assess the coding
quality of LLMs without human involvement. We find that proprietary LLMs like
GPT-5, Claude-Sonnet-4, and Claude-Opus-4 still lead in code generation
performance among recent emerging models.