Multi-LCB: LiveCodeBench를 여러 프로그래밍 언어로 확장

초록

LiveCodeBench(LCB)는 최근 코드 생성 작업에서 대규모 언어 모델(LLM)을 평가하기 위해 널리 채택된 벤치마크가 되었습니다. 경쟁 프로그래밍 문제를 선별하고, 지속적으로 새로운 문제를 세트에 추가하며, 출시일별로 필터링함으로써 LCB는 오염 인식 평가를 제공하고 코딩 능력에 대한 전체적인 관점을 제공합니다. 그러나 LCB는 Python에만 국한되어 있어, LLM이 실제 소프트웨어 엔지니어링에 필요한 다양한 프로그래밍 언어에 걸쳐 일반화할 수 있는지에 대한 의문이 남아 있습니다. 우리는 Python을 포함한 12가지 프로그래밍 언어에 걸쳐 LLM을 평가하는 벤치마크인 Multi-LCB를 소개합니다. Multi-LCB는 LCB 데이터셋의 Python 작업을 다른 언어의 동등한 작업으로 변환하면서 LCB의 오염 제어 및 평가 프로토콜을 유지합니다. 원래 LCB 형식과 완벽하게 호환되므로 Multi-LCB는 향후 LCB 업데이트를 자동으로 추적하여 교차 언어 코드 생성 능력의 체계적인 평가를 가능하게 하며, 모델이 Python을 훨씬 넘어서는 성능을 유지하도록 요구합니다. 우리는 Multi-LCB에서 명령어 및 추론에 대해 24개의 LLM을 평가하여 Python 과적합, 언어별 오염, 다국어 성능의 상당한 격차 증거를 발견했습니다. 우리의 결과는 Multi-LCB를 다중 프로그래밍 언어 코드 평가를 위한 엄격한 새 벤치마크로 확립하며, LCB의 주요 한계를 직접 해결하고 현재 LLM 역량의 중요한 격차를 드러냅니다.

English

LiveCodeBench (LCB) has recently become a widely adopted benchmark for evaluating large language models (LLMs) on code-generation tasks. By curating competitive programming problems, constantly adding fresh problems to the set, and filtering them by release dates, LCB provides contamination-aware evaluation and offers a holistic view of coding capability. However, LCB remains restricted to Python, leaving open the question of whether LLMs can generalize across the diverse programming languages required in real-world software engineering. We introduce Multi-LCB, a benchmark for evaluating LLMs across twelve programming languages, including Python. Multi-LCB transforms Python tasks from the LCB dataset into equivalent tasks in other languages while preserving LCB's contamination controls and evaluation protocol. Because it is fully compatible with the original LCB format, Multi-LCB will automatically track future LCB updates, enabling systematic assessment of cross-language code generation competence and requiring models to sustain performance well beyond Python. We evaluated 24 LLMs for instruction and reasoning on Multi-LCB, uncovering evidence of Python overfitting, language-specific contamination, and substantial disparities in multilingual performance. Our results establish Multi-LCB as a rigorous new benchmark for multi-programming-language code evaluation, directly addressing LCB's primary limitation and exposing critical gaps in current LLM capabilities.