LoCoBench：面向复杂软件工程的长上下文大语言模型基准测试

摘要

随着上下文窗口扩展至数百万token的长上下文语言模型的出现，为复杂的代码理解和软件开发评估创造了新的机遇。我们提出了LoCoBench，这是一个专门设计用于评估长上下文大语言模型（LLM）在现实复杂软件开发场景中的综合基准。与现有专注于单函数补全或短上下文任务的代码评估基准不同，LoCoBench填补了长上下文能力评估的关键空白，这些能力要求理解整个代码库、跨多个文件进行推理，并在大规模软件系统中保持架构一致性。我们的基准提供了系统生成的8,000个评估场景，涵盖10种编程语言，上下文长度从10K到1M token不等，100倍的跨度使得在现实软件开发环境中精确评估长上下文性能下降成为可能。LoCoBench引入了8个任务类别，捕捉了长上下文能力的核心：架构理解、跨文件重构、多会话开发、缺陷调查、功能实现、代码理解、集成测试和安全分析。通过一个五阶段的流程，我们创建了多样化的高质量场景，挑战LLM在空前规模上对复杂代码库进行推理的能力。我们引入了一个包含17个指标的综合评估框架，涵盖4个维度，其中包括8个新的评估指标，综合为LoCoBench评分（LCBS）。我们对最先进的长上下文模型的评估揭示了显著的性能差距，表明在复杂软件开发中的长上下文理解仍是一个亟待解决的重大挑战，需要更多关注。LoCoBench已发布于：https://github.com/SalesforceAIResearch/LoCoBench。

English

The emergence of long-context language models with context windows extending to millions of tokens has created new opportunities for sophisticated code understanding and software development evaluation. We propose LoCoBench, a comprehensive benchmark specifically designed to evaluate long-context LLMs in realistic, complex software development scenarios. Unlike existing code evaluation benchmarks that focus on single-function completion or short-context tasks, LoCoBench addresses the critical evaluation gap for long-context capabilities that require understanding entire codebases, reasoning across multiple files, and maintaining architectural consistency across large-scale software systems. Our benchmark provides 8,000 evaluation scenarios systematically generated across 10 programming languages, with context lengths spanning 10K to 1M tokens, a 100x variation that enables precise assessment of long-context performance degradation in realistic software development settings. LoCoBench introduces 8 task categories that capture essential long-context capabilities: architectural understanding, cross-file refactoring, multi-session development, bug investigation, feature implementation, code comprehension, integration testing, and security analysis. Through a 5-phase pipeline, we create diverse, high-quality scenarios that challenge LLMs to reason about complex codebases at unprecedented scale. We introduce a comprehensive evaluation framework with 17 metrics across 4 dimensions, including 8 new evaluation metrics, combined in a LoCoBench Score (LCBS). Our evaluation of state-of-the-art long-context models reveals substantial performance gaps, demonstrating that long-context understanding in complex software development represents a significant unsolved challenge that demands more attention. LoCoBench is released at: https://github.com/SalesforceAIResearch/LoCoBench.

LoCoBench：面向复杂软件工程的长上下文大语言模型基准测试

LoCoBench: A Benchmark for Long-Context Large Language Models in Complex Software Engineering

摘要

Support