LOCA-bench:可控极端上下文增长下的语言智能体基准测试框架
LOCA-bench: Benchmarking Language Agents Under Controllable and Extreme Context Growth
February 8, 2026
作者: Weihao Zeng, Yuzhen Huang, Junxian He
cs.AI
摘要
大型语言模型(LLMs)执行长期真实任务的能力日益增强。然而随着上下文量的增加,其可靠性往往会出现下降,这一现象被称为"上下文腐化"。现有的长上下文基准测试主要聚焦于单步场景,仅评估模型从长文本片段中检索信息的能力。但在实际应用中,LLMs通常需要作为智能体来探索环境、遵循指令与计划、提取有效信息,并在动态增长的上下文中预测正确行动。为评估语言智能体在此类场景中的表现,我们推出LOCA-bench(长上下文智能体基准测试框架)。给定任务提示后,LOCA-bench通过自动化可扩展的环境状态控制来调节智能体的上下文长度。该设计使得LOCA-bench能在保持底层任务语义固定的前提下,以可控方式将上下文长度无限延伸。LOCA-bench将语言智能体视为模型与框架的组合体进行评估,涵盖多种上下文管理策略。实验表明,虽然智能体性能会随环境状态复杂度增加而普遍下降,但先进的上下文管理技术能显著提升整体成功率。我们开源LOCA-bench以提供长上下文智能场景下的模型与框架评估平台:https://github.com/hkust-nlp/LOCA-bench
English
Large language models (LLMs) are increasingly capable of carrying out long-running, real-world tasks. However, as the amount of context grows, their reliability often deteriorates, a phenomenon known as "context rot". Existing long-context benchmarks primarily focus on single-step settings that evaluate a model's ability to retrieve information from a long snippet. In realistic scenarios, however, LLMs often need to act as agents that explore environments, follow instructions and plans, extract useful information, and predict correct actions under a dynamically growing context. To assess language agents in such settings, we introduce LOCA-bench (a benchmark for LOng-Context Agents). Given a task prompt, LOCA-bench leverages automated and scalable control of environment states to regulate the agent's context length. This design enables LOCA-bench to extend the context length potentially to infinity in a controlled way while keeping the underlying task semantics fixed. LOCA-bench evaluates language agents as a combination of models and scaffolds, including various context management strategies. While agent performance generally degrades as the environment states grow more complex, advanced context management techniques can substantially improve the overall success rate. We open-source LOCA-bench to provide a platform for evaluating models and scaffolds in long-context, agentic scenarios: https://github.com/hkust-nlp/LOCA-bench