Multi-LCB：將 LiveCodeBench 擴展至多種程式語言

摘要

LiveCodeBench (LCB) 近期已成為評估大型語言模型程式碼生成任務的廣泛採用基準。透過篩選競賽編程題目、持續新增題庫並按發布日期過濾，LCB 提供了具污染感知的評估，並呈現程式能力的整體面貌。然而，LCB 仍侷限於 Python 語言，未檢驗 LLM 是否能在真實軟體工程所需的多樣程式語言中具備泛化能力。我們提出 Multi-LCB，這是一個針對十二種程式語言（包含 Python）評估 LLM 的基準。Multi-LCB 將 LCB 資料集中的 Python 任務轉換為其他語言的等效任務，同時保留 LCB 的污染控制機制與評估流程。由於與原始 LCB 格式完全相容，Multi-LCB 將自動追蹤未來 LCB 的更新，從而系統性地評估跨語言程式碼生成能力，並要求模型在 Python 之外的語言中維持同等表現。我們在 Multi-LCB 上評估了 24 個指令與推理導向的 LLM，發現 Python 過度擬合、語言特定的污染問題，以及多語言表現的巨大差異。研究結果確立了 Multi-LCB 作為多程式語言程式碼評估的嚴謹新基準，直接回應了 LCB 的主要限制，並揭示了當前 LLM 能力的關鍵缺口。

English

LiveCodeBench (LCB) has recently become a widely adopted benchmark for evaluating large language models (LLMs) on code-generation tasks. By curating competitive programming problems, constantly adding fresh problems to the set, and filtering them by release dates, LCB provides contamination-aware evaluation and offers a holistic view of coding capability. However, LCB remains restricted to Python, leaving open the question of whether LLMs can generalize across the diverse programming languages required in real-world software engineering. We introduce Multi-LCB, a benchmark for evaluating LLMs across twelve programming languages, including Python. Multi-LCB transforms Python tasks from the LCB dataset into equivalent tasks in other languages while preserving LCB's contamination controls and evaluation protocol. Because it is fully compatible with the original LCB format, Multi-LCB will automatically track future LCB updates, enabling systematic assessment of cross-language code generation competence and requiring models to sustain performance well beyond Python. We evaluated 24 LLMs for instruction and reasoning on Multi-LCB, uncovering evidence of Python overfitting, language-specific contamination, and substantial disparities in multilingual performance. Our results establish Multi-LCB as a rigorous new benchmark for multi-programming-language code evaluation, directly addressing LCB's primary limitation and exposing critical gaps in current LLM capabilities.