LOCA-bench：可控极端上下文增长下的语言智能体基准测试

摘要

大型语言模型（LLMs）执行长期现实任务的能力日益增强。然而随着上下文量的增长，其可靠性往往会出现下降，这一现象被称为"上下文衰减"。现有的长上下文基准测试主要聚焦于单步场景，仅评估模型从长文本片段中检索信息的能力。但在实际应用中，LLMs常需作为智能体运作：探索环境、遵循指令与规划、提取有效信息，并在动态增长的上下文中预测正确行动。为评估语言智能体在此类场景下的表现，我们推出LOCA-bench（面向长上下文智能体的基准测试框架）。给定任务提示后，LOCA-bench通过自动化可扩展的环境状态控制来调节智能体的上下文长度。该设计使LOCA-bench能在保持底层任务语义不变的前提下，以可控方式将上下文长度无限延伸。LOCA-bench将语言智能体视为模型与架构的组合进行评估，涵盖多种上下文管理策略。实验表明，虽然智能体性能会随环境状态复杂度增加而普遍下降，但先进的上下文管理技术能显著提升整体成功率。我们开源LOCA-bench平台，旨在为长上下文智能场景下的模型与架构评估提供支持：https://github.com/hkust-nlp/LOCA-bench

English

Large language models (LLMs) are increasingly capable of carrying out long-running, real-world tasks. However, as the amount of context grows, their reliability often deteriorates, a phenomenon known as "context rot". Existing long-context benchmarks primarily focus on single-step settings that evaluate a model's ability to retrieve information from a long snippet. In realistic scenarios, however, LLMs often need to act as agents that explore environments, follow instructions and plans, extract useful information, and predict correct actions under a dynamically growing context. To assess language agents in such settings, we introduce LOCA-bench (a benchmark for LOng-Context Agents). Given a task prompt, LOCA-bench leverages automated and scalable control of environment states to regulate the agent's context length. This design enables LOCA-bench to extend the context length potentially to infinity in a controlled way while keeping the underlying task semantics fixed. LOCA-bench evaluates language agents as a combination of models and scaffolds, including various context management strategies. While agent performance generally degrades as the environment states grow more complex, advanced context management techniques can substantially improve the overall success rate. We open-source LOCA-bench to provide a platform for evaluating models and scaffolds in long-context, agentic scenarios: https://github.com/hkust-nlp/LOCA-bench

LOCA-bench：可控极端上下文增长下的语言智能体基准测试

LOCA-bench: Benchmarking Language Agents Under Controllable and Extreme Context Growth

摘要

Support