LoCoBench-Agent: 장문 컨텍스트 소프트웨어 엔지니어링을 위한 LLM 에이전트 상호작용 벤치마크

초록

대규모 언어 모델(LLM)이 복잡한 소프트웨어 개발 작업을 수행할 수 있는 정교한 자율 에이전트로 진화함에 따라, 실제 환경에서의 성능 평가가 중요해졌습니다. LoCoBench~qiu2025locobench과 같은 기존 벤치마크는 장문맥(Long-Context) 코드 이해력을 평가하지만, 단일 턴 평가에 집중하여 현실적인 코딩 에이전트에게 요구되는 다중 턴 상호작용 특성, 도구 사용 패턴, 적응형 추론 능력을 포착하지 못합니다. 본 논문에서는 현실적인 장문맥 소프트웨어 엔지니어링 워크플로우에서 LLM 에이전트를 평가하기 위해 특별히 설계된 포괄적인 평가 프레임워크인 LoCoBench-Agent를 소개합니다. 우리의 프레임워크는 LoCoBench의 8,000개 시나리오를 상호작용형 에이전트 환경으로 확장하여, 다중 턴 대화, 도구 사용 효율, 오류 복구, 장기간 개발 세션에서의 아키텍처 일관성을 체계적으로 평가할 수 있게 합니다. 또한 이해도와 효율성 차원의 9가지 평가 메트릭을 포함한 평가 방법론을 도입합니다. 본 프레임워크는 에이전트에게 8개의 전용 도구(파일 작업, 검색, 코드 분석)를 제공하고 10K에서 1M 토큰에 이르는 다양한 컨텍스트 길이에서 평가하여 장문맥 성능을 정밀하게 분석합니다. 최신 모델들을 체계적으로 평가한 결과 몇 가지 주요 발견점을 도출했습니다: (1) 에이전트는 놀라운 장문맥 강건성을 보인다; (2) 철저한 탐색은 이해도를 높이지만 효율성을 감소시키는, 이해도와 효율성 간 부(-)의 상관관계를 가진 트레이드오프가 존재한다; (3) 대화 효율성은 모델 간 현저한 차이를 보이며, 전략적 도구 사용 패턴이 고성능 에이전트를 구분한다. 소프트웨어 엔지니어링 분야 최초의 장문맥 LLM 에이전트 벤치마크로서, LoCoBench-Agent는 에이전트 능력 측정, 성능 격차 식별, 그리고 대규모 자율 소프트웨어 개발 발전을 위한 견고한 기반을 마련합니다.

English

As large language models (LLMs) evolve into sophisticated autonomous agents capable of complex software development tasks, evaluating their real-world capabilities becomes critical. While existing benchmarks like LoCoBench~qiu2025locobench assess long-context code understanding, they focus on single-turn evaluation and cannot capture the multi-turn interactive nature, tool usage patterns, and adaptive reasoning required by real-world coding agents. We introduce LoCoBench-Agent, a comprehensive evaluation framework specifically designed to assess LLM agents in realistic, long-context software engineering workflows. Our framework extends LoCoBench's 8,000 scenarios into interactive agent environments, enabling systematic evaluation of multi-turn conversations, tool usage efficiency, error recovery, and architectural consistency across extended development sessions. We also introduce an evaluation methodology with 9 metrics across comprehension and efficiency dimensions. Our framework provides agents with 8 specialized tools (file operations, search, code analysis) and evaluates them across context lengths ranging from 10K to 1M tokens, enabling precise assessment of long-context performance. Through systematic evaluation of state-of-the-art models, we reveal several key findings: (1) agents exhibit remarkable long-context robustness; (2) comprehension-efficiency trade-off exists with negative correlation, where thorough exploration increases comprehension but reduces efficiency; and (3) conversation efficiency varies dramatically across models, with strategic tool usage patterns differentiating high-performing agents. As the first long-context LLM agent benchmark for software engineering, LoCoBench-Agent establishes a rigorous foundation for measuring agent capabilities, identifying performance gaps, and advancing autonomous software development at scale.

LoCoBench-Agent: 장문 컨텍스트 소프트웨어 엔지니어링을 위한 LLM 에이전트 상호작용 벤치마크

LoCoBench-Agent: An Interactive Benchmark for LLM Agents in Long-Context Software Engineering

초록

Support