Multi-LCB: Uitbreiding van LiveCodeBench naar meerdere programmeertalen

Samenvatting

LiveCodeBench (LCB) is recentelijk een veelgebruikte benchmark geworden voor het evalueren van grote taalmodellen (LLMs) op codegeneratietaken. Door competitieve programmeerproblemen te cureren, voortdurend nieuwe problemen aan de set toe te voegen en deze te filteren op releasedata, biedt LCB een contaminatiebewuste evaluatie en een holistisch beeld van codeervaardigheid. LCB blijft echter beperkt tot Python, wat de vraag openlaat of LLMs kunnen generaliseren over de diverse programmeertalen die in de praktijk van software-engineering vereist zijn. Wij introduceren Multi-LCB, een benchmark voor het evalueren van LLMs over twaalf programmeertalen, waaronder Python. Multi-LCB zet Python-taken uit de LCB-dataset om in equivalente taken in andere talen, terwijl de contaminatiecontroles en het evaluatieprotocol van LCB behouden blijven. Omdat het volledig compatibel is met het oorspronkelijke LCB-formaat, zal Multi-LCB automatisch toekomstige LCB-updates volgen, wat een systematische beoordeling van cross-linguale codegeneratiecompetentie mogelijk maakt en vereist dat modellen prestaties ver boven Python uit blijven leveren. Wij evalueerden 24 LLMs op instructie en reasoning met Multi-LCB, waarbij wij bewijs vonden van Python-overfitting, taalspecifieke contaminatie en aanzienlijke verschillen in meertalige prestaties. Onze resultaten vestigen Multi-LCB als een rigoureuze nieuwe benchmark voor code-evaluatie in meerdere programmeertalen, waarmee rechtstreeks wordt ingespeeld op de primaire beperking van LCB en kritieke hiaten in de huidige LLM-mogelijkheden worden blootgelegd.

English

LiveCodeBench (LCB) has recently become a widely adopted benchmark for evaluating large language models (LLMs) on code-generation tasks. By curating competitive programming problems, constantly adding fresh problems to the set, and filtering them by release dates, LCB provides contamination-aware evaluation and offers a holistic view of coding capability. However, LCB remains restricted to Python, leaving open the question of whether LLMs can generalize across the diverse programming languages required in real-world software engineering. We introduce Multi-LCB, a benchmark for evaluating LLMs across twelve programming languages, including Python. Multi-LCB transforms Python tasks from the LCB dataset into equivalent tasks in other languages while preserving LCB's contamination controls and evaluation protocol. Because it is fully compatible with the original LCB format, Multi-LCB will automatically track future LCB updates, enabling systematic assessment of cross-language code generation competence and requiring models to sustain performance well beyond Python. We evaluated 24 LLMs for instruction and reasoning on Multi-LCB, uncovering evidence of Python overfitting, language-specific contamination, and substantial disparities in multilingual performance. Our results establish Multi-LCB as a rigorous new benchmark for multi-programming-language code evaluation, directly addressing LCB's primary limitation and exposing critical gaps in current LLM capabilities.