Multi-LCB：将LiveCodeBench扩展到多种编程语言

摘要

LiveCodeBench (LCB) 近年来已成为评估大语言模型（LLM）代码生成能力的广泛采用基准。通过精选编程竞赛题目、持续向题库补充新题并按发布时间筛选，LCB 实现了防污染评估，并提供了编程能力的整体视图。然而，LCB 仍局限于 Python 语言，未能解答 LLM 能否泛化至实际软件工程中所需多种编程语言的问题。我们提出 Multi-LCB 基准，用于评估 LLM 在包括 Python 在内的十二种编程语言上的表现。Multi-LCB 将 LCB 数据集中的 Python 任务转化为其他语言中的等价任务，同时保留 LCB 的防污染机制与评估协议。由于与原始 LCB 格式完全兼容，Multi-LCB 将自动追踪 LCB 后续更新，实现跨语言代码生成能力的系统性评估，要求模型在 Python 之外仍能维持同等性能。我们在 Multi-LCB 上评估了 24 个面向指令与推理的 LLM，发现了 Python 过拟合、特定语言污染以及多语言性能显著差异的证据。实验结果确立了 Multi-LCB 作为多编程语言代码评估领域严格新基准的地位，直接弥补了 LCB 的主要局限，并揭示了当前 LLM 能力中的关键短板。

English

LiveCodeBench (LCB) has recently become a widely adopted benchmark for evaluating large language models (LLMs) on code-generation tasks. By curating competitive programming problems, constantly adding fresh problems to the set, and filtering them by release dates, LCB provides contamination-aware evaluation and offers a holistic view of coding capability. However, LCB remains restricted to Python, leaving open the question of whether LLMs can generalize across the diverse programming languages required in real-world software engineering. We introduce Multi-LCB, a benchmark for evaluating LLMs across twelve programming languages, including Python. Multi-LCB transforms Python tasks from the LCB dataset into equivalent tasks in other languages while preserving LCB's contamination controls and evaluation protocol. Because it is fully compatible with the original LCB format, Multi-LCB will automatically track future LCB updates, enabling systematic assessment of cross-language code generation competence and requiring models to sustain performance well beyond Python. We evaluated 24 LLMs for instruction and reasoning on Multi-LCB, uncovering evidence of Python overfitting, language-specific contamination, and substantial disparities in multilingual performance. Our results establish Multi-LCB as a rigorous new benchmark for multi-programming-language code evaluation, directly addressing LCB's primary limitation and exposing critical gaps in current LLM capabilities.