BigCodeArena:透過執行揭示更可靠的程式碼生成人為偏好
BigCodeArena: Unveiling More Reliable Human Preferences in Code Generation via Execution
October 9, 2025
作者: Terry Yue Zhuo, Xiaolong Jin, Hange Liu, Juyong Jiang, Tianyang Liu, Chen Gong, Bhupesh Bishnoi, Vaisakhi Mishra, Marek Suppa, Noah Ziems, Saiteja Utpala, Ming Xu, Guangyu Song, Kaixin Li, Yuhan Cao, Bo Liu, Zheng Liu, Sabina Abdurakhmanova, Wenhao Yu, Mengzhao Jia, Jihan Yao, Kenneth Hamilton, Kumar Shridhar, Minh Chien Vu, Dingmin Wang, Jiawei Liu, Zijian Wang, Qian Liu, Binyuan Hui, Meg Risdal, Ahsen Khaliq, Atin Sood, Zhenchang Xing, Wasi Uddin Ahmad, John Grundy, David Lo, Banghua Zhu, Xiaoning Du, Torsten Scholak, Leandro von Werra
cs.AI
摘要
眾包模型評估平台,如Chatbot Arena,能夠從人類視角進行即時評估,以判斷模型回應的品質。在編碼領域,手動檢驗大型語言模型(LLM)生成內容的品質極為困難,因為這需要理解大段原始代碼並模擬代碼執行過程。為此,我們推出了BigCodeArena,這是一個開放的人類評估平台,專注於代碼生成,並配備了全面且即時的執行環境。基於Chatbot Arena構建的BigCodeArena,能夠執行LLM生成的代碼,並允許人類與執行過程及結果互動。我們收集了超過14,000個以代碼為中心的原始對話會話,涵蓋10種廣泛使用的LLM、10種編程語言及8種執行環境類型。在這些對話中,我們識別出超過4,700個帶有配對人類偏好的多輪樣本。進一步分析揭示了LLM在由任務、語言和框架定義的細分領域中尚未被充分探索的偏好。為了系統性地檢驗前沿LLM的代碼理解與生成能力,我們基於收集的數據策劃了兩個基準測試,分別是BigCodeReward和AutoCodeArena。對於BigCodeReward,我們對4,700個對話進行了後處理,並評估了獎勵模型與人類偏好之間的一致性。評估結果顯示,當執行結果可得時,多數LLM在判斷編碼偏好方面表現優異。受此啟發,我們提出了AutoCodeArena,這是一個自動化的Elo評分基準,旨在無需人類參與的情況下評估LLM的代碼生成質量。我們發現,在近期湧現的模型中,如GPT-5、Claude-Sonnet-4和Claude-Opus-4等專有LLM,在代碼生成性能上仍處於領先地位。
English
Crowdsourced model evaluation platforms, such as Chatbot Arena, enable
real-time evaluation from human perspectives to assess the quality of model
responses. In the coding domain, manually examining the quality of
LLM-generated content is extremely challenging, as it requires understanding
long chunks of raw code and deliberately simulating code execution. To this
end, we introduce BigCodeArena, an open human evaluation platform for code
generation backed by a comprehensive and on-the-fly execution environment.
Built on top of Chatbot Arena, BigCodeArena enables the execution of
LLM-generated code and allows humans to interact with the execution process and
outcomes. We collected over 14,000 raw code-centric conversation sessions
across 10 widely used LLMs, spanning 10 languages and 8 types of execution
environments. Among these conversations, we identified more than 4,700
multi-turn samples with pairwise human preferences. Further analysis uncovers
underexplored preferences of LLMs in fine-grained domains characterized by
tasks, languages, and frameworks. To systematically examine code understanding
and generation capabilities of frontier LLMs, we curated two benchmarks based
on the collected data, namely BigCodeReward and AutoCodeArena. For
BigCodeReward, we post-processed the 4,700 conversations and evaluated the
consistency between reward models and human preferences. The evaluation shows
that most LLMs have superior performance in judging coding preferences when the
execution results are available. Inspired by these findings, we propose
AutoCodeArena, an automatic Elo rating benchmark designed to assess the coding
quality of LLMs without human involvement. We find that proprietary LLMs like
GPT-5, Claude-Sonnet-4, and Claude-Opus-4 still lead in code generation
performance among recent emerging models.