LoCoBench:針對複雜軟體工程中長上下文大型語言模型的基準測試
LoCoBench: A Benchmark for Long-Context Large Language Models in Complex Software Engineering
September 11, 2025
作者: Jielin Qiu, Zuxin Liu, Zhiwei Liu, Rithesh Murthy, Jianguo Zhang, Haolin Chen, Shiyu Wang, Ming Zhu, Liangwei Yang, Juntao Tan, Zhepeng Cen, Cheng Qian, Shelby Heinecke, Weiran Yao, Silvio Savarese, Caiming Xiong, Huan Wang
cs.AI
摘要
長上下文語言模型的出現,其上下文窗口擴展至數百萬個標記,為複雜的程式碼理解與軟體開發評估開創了新的契機。我們提出了LoCoBench,這是一個專為評估長上下文大型語言模型(LLMs)在真實且複雜的軟體開發場景中表現而設計的全面基準測試。與現有的程式碼評估基準不同,後者多聚焦於單一函數完成或短上下文任務,LoCoBench則填補了對長上下文能力評估的關鍵空白,這些能力要求理解整個程式碼庫、跨多個檔案進行推理,並在大型軟體系統中保持架構一致性。我們的基準測試提供了系統生成的8,000個評估場景,涵蓋10種程式語言,上下文長度從10K到1M標記不等,這一100倍的變化範圍使得我們能夠精確評估在實際軟體開發環境中長上下文性能的衰減情況。LoCoBench引入了8個任務類別,涵蓋了長上下文能力的核心方面:架構理解、跨檔案重構、多階段開發、錯誤調查、功能實現、程式碼理解、整合測試及安全分析。通過一個五階段的流程,我們創建了多樣化且高質量的場景,挑戰LLMs在空前規模上對複雜程式碼庫進行推理的能力。我們引入了一個全面的評估框架,包含4個維度的17項指標,其中包括8個新的評估指標,這些指標綜合形成LoCoBench評分(LCBS)。我們對當前最先進的長上下文模型進行評估,揭示了顯著的性能差距,表明在複雜軟體開發中實現長上下文理解仍是一個亟待解決的重大挑戰,需要更多關注。LoCoBench已發佈於:https://github.com/SalesforceAIResearch/LoCoBench。
English
The emergence of long-context language models with context windows extending
to millions of tokens has created new opportunities for sophisticated code
understanding and software development evaluation. We propose LoCoBench, a
comprehensive benchmark specifically designed to evaluate long-context LLMs in
realistic, complex software development scenarios. Unlike existing code
evaluation benchmarks that focus on single-function completion or short-context
tasks, LoCoBench addresses the critical evaluation gap for long-context
capabilities that require understanding entire codebases, reasoning across
multiple files, and maintaining architectural consistency across large-scale
software systems. Our benchmark provides 8,000 evaluation scenarios
systematically generated across 10 programming languages, with context lengths
spanning 10K to 1M tokens, a 100x variation that enables precise assessment of
long-context performance degradation in realistic software development
settings. LoCoBench introduces 8 task categories that capture essential
long-context capabilities: architectural understanding, cross-file refactoring,
multi-session development, bug investigation, feature implementation, code
comprehension, integration testing, and security analysis. Through a 5-phase
pipeline, we create diverse, high-quality scenarios that challenge LLMs to
reason about complex codebases at unprecedented scale. We introduce a
comprehensive evaluation framework with 17 metrics across 4 dimensions,
including 8 new evaluation metrics, combined in a LoCoBench Score (LCBS). Our
evaluation of state-of-the-art long-context models reveals substantial
performance gaps, demonstrating that long-context understanding in complex
software development represents a significant unsolved challenge that demands
more attention. LoCoBench is released at:
https://github.com/SalesforceAIResearch/LoCoBench.