LoCoBench: Een benchmark voor lange-context grote taalmodellen in complexe software-engineering

Samenvatting

De opkomst van taalmodelen met lange contextvensters die zich uitstrekken tot miljoenen tokens heeft nieuwe mogelijkheden gecreëerd voor geavanceerd codebegrip en evaluatie van softwareontwikkeling. Wij stellen LoCoBench voor, een uitgebreide benchmark die specifiek is ontworpen om taalmodelen met lange context te evalueren in realistische, complexe softwareontwikkelingsscenario's. In tegenstelling tot bestaande code-evaluatiebenchmarks die zich richten op het voltooien van enkele functies of taken met korte context, adresseert LoCoBench het kritieke evaluatiegat voor lange-contextcapaciteiten die het begrijpen van volledige codebases, redeneren over meerdere bestanden en het handhaven van architectonische consistentie in grootschalige softwaresystemen vereisen. Onze benchmark biedt 8.000 evaluatiescenario's die systematisch zijn gegenereerd over 10 programmeertalen, met contextlengtes variërend van 10K tot 1M tokens, een variatie van 100x die een nauwkeurige beoordeling van prestatieverlies bij lange context in realistische softwareontwikkelingsomgevingen mogelijk maakt. LoCoBench introduceert 8 taakcategorieën die essentiële lange-contextcapaciteiten vastleggen: architectonisch begrip, cross-file refactoring, multi-sessie ontwikkeling, bugonderzoek, functie-implementatie, codebegrip, integratietesten en beveiligingsanalyse. Via een 5-fasenpijplijn creëren we diverse, hoogwaardige scenario's die taalmodelen uitdagen om te redeneren over complexe codebases op een ongekende schaal. We introduceren een uitgebreid evaluatieraamwerk met 17 metrieken over 4 dimensies, waaronder 8 nieuwe evaluatiemetrieken, gecombineerd in een LoCoBench Score (LCBS). Onze evaluatie van state-of-the-art lange-contextmodellen onthult aanzienlijke prestatiekloof, wat aantoont dat lange-contextbegrip in complexe softwareontwikkeling een significant onopgeloste uitdaging blijft die meer aandacht vereist. LoCoBench is vrijgegeven op: https://github.com/SalesforceAIResearch/LoCoBench.

English

The emergence of long-context language models with context windows extending to millions of tokens has created new opportunities for sophisticated code understanding and software development evaluation. We propose LoCoBench, a comprehensive benchmark specifically designed to evaluate long-context LLMs in realistic, complex software development scenarios. Unlike existing code evaluation benchmarks that focus on single-function completion or short-context tasks, LoCoBench addresses the critical evaluation gap for long-context capabilities that require understanding entire codebases, reasoning across multiple files, and maintaining architectural consistency across large-scale software systems. Our benchmark provides 8,000 evaluation scenarios systematically generated across 10 programming languages, with context lengths spanning 10K to 1M tokens, a 100x variation that enables precise assessment of long-context performance degradation in realistic software development settings. LoCoBench introduces 8 task categories that capture essential long-context capabilities: architectural understanding, cross-file refactoring, multi-session development, bug investigation, feature implementation, code comprehension, integration testing, and security analysis. Through a 5-phase pipeline, we create diverse, high-quality scenarios that challenge LLMs to reason about complex codebases at unprecedented scale. We introduce a comprehensive evaluation framework with 17 metrics across 4 dimensions, including 8 new evaluation metrics, combined in a LoCoBench Score (LCBS). Our evaluation of state-of-the-art long-context models reveals substantial performance gaps, demonstrating that long-context understanding in complex software development represents a significant unsolved challenge that demands more attention. LoCoBench is released at: https://github.com/SalesforceAIResearch/LoCoBench.

LoCoBench: Een benchmark voor lange-context grote taalmodellen in complexe software-engineering

LoCoBench: A Benchmark for Long-Context Large Language Models in Complex Software Engineering

Samenvatting

Support