Multi-LCB: LiveCodeBenchの複数プログラミング言語への拡張

要旨

LiveCodeBench（LCB）は近年、大規模言語モデル（LLM）のコード生成タスクを評価するための広く採用されたベンチマークとなっている。競技プログラミング問題を厳選し、新しい問題を継続的に追加し、リリース日でフィルタリングすることで、LCBは汚染を考慮した評価を提供し、コーディング能力の全体的な視点を示している。しかし、LCBは依然としてPythonに限定されており、LLMが実際のソフトウェアエンジニアリングで必要とされる多様なプログラミング言語にわたって一般化できるかどうかという疑問が残っている。我々は、Pythonを含む12のプログラミング言語にわたってLLMを評価するベンチマークであるMulti-LCBを導入する。Multi-LCBは、LCBデータセットのPythonタスクを他の言語の同等のタスクに変換し、LCBの汚染管理と評価プロトコルを維持する。元のLCB形式と完全に互換性があるため、Multi-LCBは将来のLCB更新を自動的に追跡し、言語横断的なコード生成能力の体系的な評価を可能にし、モデルがPythonを超えて性能を維持することを要求する。我々は、24のLLMをMulti-LCB上で指示追従と推論について評価し、Pythonへの過適合、言語固有の汚染、多言語性能の大幅な格差の証拠を明らかにした。結果は、Multi-LCBを多プログラミング言語コード評価のための厳格な新しいベンチマークとして確立し、LCBの主要な制限に直接対処し、現在のLLM能力における重要なギャップを露呈するものである。

English

LiveCodeBench (LCB) has recently become a widely adopted benchmark for evaluating large language models (LLMs) on code-generation tasks. By curating competitive programming problems, constantly adding fresh problems to the set, and filtering them by release dates, LCB provides contamination-aware evaluation and offers a holistic view of coding capability. However, LCB remains restricted to Python, leaving open the question of whether LLMs can generalize across the diverse programming languages required in real-world software engineering. We introduce Multi-LCB, a benchmark for evaluating LLMs across twelve programming languages, including Python. Multi-LCB transforms Python tasks from the LCB dataset into equivalent tasks in other languages while preserving LCB's contamination controls and evaluation protocol. Because it is fully compatible with the original LCB format, Multi-LCB will automatically track future LCB updates, enabling systematic assessment of cross-language code generation competence and requiring models to sustain performance well beyond Python. We evaluated 24 LLMs for instruction and reasoning on Multi-LCB, uncovering evidence of Python overfitting, language-specific contamination, and substantial disparities in multilingual performance. Our results establish Multi-LCB as a rigorous new benchmark for multi-programming-language code evaluation, directly addressing LCB's primary limitation and exposing critical gaps in current LLM capabilities.