LoCoBench-Agent:面向长上下文软件工程的大模型智能体交互式基准测试框架
LoCoBench-Agent: An Interactive Benchmark for LLM Agents in Long-Context Software Engineering
November 17, 2025
作者: Jielin Qiu, Zuxin Liu, Zhiwei Liu, Rithesh Murthy, Jianguo Zhang, Haolin Chen, Shiyu Wang, Ming Zhu, Liangwei Yang, Juntao Tan, Roshan Ram, Akshara Prabhakar, Tulika Awalgaonkar, Zixiang Chen, Zhepeng Cen, Cheng Qian, Shelby Heinecke, Weiran Yao, Silvio Savarese, Caiming Xiong, Huan Wang
cs.AI
摘要
随着大型语言模型(LLMs)逐步发展为能够执行复杂软件开发任务的自主智能体,评估其实际能力变得至关重要。现有基准测试(如LoCoBench~qiu2025locobench)虽能评估长上下文代码理解能力,但仅关注单轮评测,无法捕捉现实编码智能体所需的多轮交互特性、工具使用模式和自适应推理能力。我们推出LoCoBench-Agent——一个专为评估LLM智能体在真实长上下文软件工程工作流中表现而设计的综合评估框架。该框架将LoCoBench的8000个场景扩展为交互式智能体环境,支持对多轮对话、工具使用效率、错误恢复能力以及长周期开发会话中架构一致性的系统化评估。我们同时提出包含9项理解度与效率维度指标的评估方法,为智能体提供8种专用工具(文件操作、搜索、代码分析等),并在10K至1M令牌的上下文长度范围内进行评测,实现对长上下文性能的精准衡量。通过对前沿模型的系统评估,我们获得关键发现:(1)智能体展现出显著的长上下文鲁棒性;(2)理解度与效率存在负相关的权衡关系,深入探索会提升理解度但降低效率;(3)不同模型的对话效率差异显著,策略性工具使用模式是高性能智能体的关键区分特征。作为首个面向软件工程的长上下文LLM智能体基准,LoCoBench-Agent为衡量智能体能力、识别性能差距及推进规模化自主软件开发奠定了严谨基础。
English
As large language models (LLMs) evolve into sophisticated autonomous agents capable of complex software development tasks, evaluating their real-world capabilities becomes critical. While existing benchmarks like LoCoBench~qiu2025locobench assess long-context code understanding, they focus on single-turn evaluation and cannot capture the multi-turn interactive nature, tool usage patterns, and adaptive reasoning required by real-world coding agents. We introduce LoCoBench-Agent, a comprehensive evaluation framework specifically designed to assess LLM agents in realistic, long-context software engineering workflows. Our framework extends LoCoBench's 8,000 scenarios into interactive agent environments, enabling systematic evaluation of multi-turn conversations, tool usage efficiency, error recovery, and architectural consistency across extended development sessions. We also introduce an evaluation methodology with 9 metrics across comprehension and efficiency dimensions. Our framework provides agents with 8 specialized tools (file operations, search, code analysis) and evaluates them across context lengths ranging from 10K to 1M tokens, enabling precise assessment of long-context performance. Through systematic evaluation of state-of-the-art models, we reveal several key findings: (1) agents exhibit remarkable long-context robustness; (2) comprehension-efficiency trade-off exists with negative correlation, where thorough exploration increases comprehension but reduces efficiency; and (3) conversation efficiency varies dramatically across models, with strategic tool usage patterns differentiating high-performing agents. As the first long-context LLM agent benchmark for software engineering, LoCoBench-Agent establishes a rigorous foundation for measuring agent capabilities, identifying performance gaps, and advancing autonomous software development at scale.