BigCodeArena：透過執行揭示更可靠的程式碼生成人為偏好

摘要

眾包模型評估平台，如Chatbot Arena，能夠從人類視角進行即時評估，以判斷模型回應的品質。在編碼領域，手動檢驗大型語言模型（LLM）生成內容的品質極為困難，因為這需要理解大段原始代碼並模擬代碼執行過程。為此，我們推出了BigCodeArena，這是一個開放的人類評估平台，專注於代碼生成，並配備了全面且即時的執行環境。基於Chatbot Arena構建的BigCodeArena，能夠執行LLM生成的代碼，並允許人類與執行過程及結果互動。我們收集了超過14,000個以代碼為中心的原始對話會話，涵蓋10種廣泛使用的LLM、10種編程語言及8種執行環境類型。在這些對話中，我們識別出超過4,700個帶有配對人類偏好的多輪樣本。進一步分析揭示了LLM在由任務、語言和框架定義的細分領域中尚未被充分探索的偏好。為了系統性地檢驗前沿LLM的代碼理解與生成能力，我們基於收集的數據策劃了兩個基準測試，分別是BigCodeReward和AutoCodeArena。對於BigCodeReward，我們對4,700個對話進行了後處理，並評估了獎勵模型與人類偏好之間的一致性。評估結果顯示，當執行結果可得時，多數LLM在判斷編碼偏好方面表現優異。受此啟發，我們提出了AutoCodeArena，這是一個自動化的Elo評分基準，旨在無需人類參與的情況下評估LLM的代碼生成質量。我們發現，在近期湧現的模型中，如GPT-5、Claude-Sonnet-4和Claude-Opus-4等專有LLM，在代碼生成性能上仍處於領先地位。

English

Crowdsourced model evaluation platforms, such as Chatbot Arena, enable real-time evaluation from human perspectives to assess the quality of model responses. In the coding domain, manually examining the quality of LLM-generated content is extremely challenging, as it requires understanding long chunks of raw code and deliberately simulating code execution. To this end, we introduce BigCodeArena, an open human evaluation platform for code generation backed by a comprehensive and on-the-fly execution environment. Built on top of Chatbot Arena, BigCodeArena enables the execution of LLM-generated code and allows humans to interact with the execution process and outcomes. We collected over 14,000 raw code-centric conversation sessions across 10 widely used LLMs, spanning 10 languages and 8 types of execution environments. Among these conversations, we identified more than 4,700 multi-turn samples with pairwise human preferences. Further analysis uncovers underexplored preferences of LLMs in fine-grained domains characterized by tasks, languages, and frameworks. To systematically examine code understanding and generation capabilities of frontier LLMs, we curated two benchmarks based on the collected data, namely BigCodeReward and AutoCodeArena. For BigCodeReward, we post-processed the 4,700 conversations and evaluated the consistency between reward models and human preferences. The evaluation shows that most LLMs have superior performance in judging coding preferences when the execution results are available. Inspired by these findings, we propose AutoCodeArena, an automatic Elo rating benchmark designed to assess the coding quality of LLMs without human involvement. We find that proprietary LLMs like GPT-5, Claude-Sonnet-4, and Claude-Opus-4 still lead in code generation performance among recent emerging models.

BigCodeArena：透過執行揭示更可靠的程式碼生成人為偏好

BigCodeArena: Unveiling More Reliable Human Preferences in Code Generation via Execution

摘要

Support