LoCoBench-Agent: 長文脈ソフトウェア工学におけるLLMエージェントの対話型ベンチマーク

要旨

大規模言語モデル（LLM）が複雑なソフトウェア開発タスクを実行できる高度な自律エージェントへと進化するにつれ、その実世界での能力を評価することが極めて重要となっている。既存のベンチマークであるLoCoBench~qiu2025locobenchは長文脈のコード理解を評価するが、シングルターンの評価に焦点を当てており、実世界のコーディングエージェントに必要とされる、マルチターンの対話的性質、ツール利用パターン、適応的推論を捉えることができない。本研究では、現実的な長文脈のソフトウェアエンジニアリングワークフローにおいて、LLMエージェントを評価するために特別に設計された包括的評価フレームワーク「LoCoBench-Agent」を提案する。本フレームワークは、LoCoBenchの8,000シナリオを対話型エージェント環境へ拡張し、長時間の開発セッションにおけるマルチターン会話、ツール利用効率、エラー回復、アーキテクチャ一貫性の系統的な評価を可能にする。さらに、理解度と効率性の次元にわたる9つの指標からなる評価手法も導入する。本フレームワークは、8つの専門ツール（ファイル操作、検索、コード分析）をエージェントに提供し、10Kトークンから1Mトークンにわたる文脈長で評価を行うことで、長文脈性能を精緻に評価する。最先端モデルの系統的評価を通じて、いくつかの重要な知見を得た：（1）エージェントは顕著な長文脈ロバスト性を示す；（2）理解度と効率性の間には負の相関があるトレードオフ関係が存在し、徹底的な探索は理解度を高めるが効率性を低下させる；（3）会話効率はモデル間で劇的に異なり、戦略的なツール利用パターンが高性能エージェントを特徴づける。ソフトウェアエンジニアリングにおける初の長文脈LLMエージェントベンチマークとして、LoCoBench-Agentは、エージェント能力の測定、性能ギャップの特定、大規模な自律的ソフトウェア開発の推進に向けた厳密な基盤を確立する。

English

As large language models (LLMs) evolve into sophisticated autonomous agents capable of complex software development tasks, evaluating their real-world capabilities becomes critical. While existing benchmarks like LoCoBench~qiu2025locobench assess long-context code understanding, they focus on single-turn evaluation and cannot capture the multi-turn interactive nature, tool usage patterns, and adaptive reasoning required by real-world coding agents. We introduce LoCoBench-Agent, a comprehensive evaluation framework specifically designed to assess LLM agents in realistic, long-context software engineering workflows. Our framework extends LoCoBench's 8,000 scenarios into interactive agent environments, enabling systematic evaluation of multi-turn conversations, tool usage efficiency, error recovery, and architectural consistency across extended development sessions. We also introduce an evaluation methodology with 9 metrics across comprehension and efficiency dimensions. Our framework provides agents with 8 specialized tools (file operations, search, code analysis) and evaluates them across context lengths ranging from 10K to 1M tokens, enabling precise assessment of long-context performance. Through systematic evaluation of state-of-the-art models, we reveal several key findings: (1) agents exhibit remarkable long-context robustness; (2) comprehension-efficiency trade-off exists with negative correlation, where thorough exploration increases comprehension but reduces efficiency; and (3) conversation efficiency varies dramatically across models, with strategic tool usage patterns differentiating high-performing agents. As the first long-context LLM agent benchmark for software engineering, LoCoBench-Agent establishes a rigorous foundation for measuring agent capabilities, identifying performance gaps, and advancing autonomous software development at scale.

LoCoBench-Agent: 長文脈ソフトウェア工学におけるLLMエージェントの対話型ベンチマーク

LoCoBench-Agent: An Interactive Benchmark for LLM Agents in Long-Context Software Engineering

要旨

Support