AgentLongBench:基于环境推演的可控长上下文智能体基准测试框架
AgentLongBench: A Controllable Long Benchmark For Long-Contexts Agents via Environment Rollouts
January 28, 2026
作者: Shicheng Fang, Yuxin Wang, XiaoRan Liu, Jiahao Lu, Chuanyuan Tan, Xinchi Chen, Yining Zheng. Xuanjing Huang, Xipeng Qiu
cs.AI
摘要
大型语言模型(LLMs)向自主智能体的演进,亟需对海量动态上下文信息进行有效管理。然而现有基准测试大多保持静态特性,依赖被动检索任务,难以模拟智能体与环境交互中的非线性推理、迭代反馈等复杂场景。为此,我们提出基于横向思维谜题的环境推演评估框架AgentLongBench,通过在知识密集型与知识无关场景中生成严谨的交互轨迹进行智能体评估。针对先进模型与内存系统(32K至400万词元)的实验揭示关键缺陷:智能体虽擅长静态检索,却在工作流必需的信息动态整合方面表现不佳。分析表明,性能退化与解决查询所需的最小词元数量直接相关,这解释了为何海量工具响应固有的高信息密度,远比长轮对话中常见的内存碎片化现象更具挑战性。
English
The evolution of Large Language Models (LLMs) into autonomous agents necessitates the management of extensive, dynamic contexts. Current benchmarks, however, remain largely static, relying on passive retrieval tasks that fail to simulate the complexities of agent-environment interaction, such as non-linear reasoning and iterative feedback. To address this, we introduce AgentLongBench, which evaluates agents through simulated environment rollouts based on Lateral Thinking Puzzles. This framework generates rigorous interaction trajectories across knowledge-intensive and knowledge-free scenarios. Experiments with state-of-the-art models and memory systems (32K to 4M tokens) expose a critical weakness: while adept at static retrieval, agents struggle with the dynamic information synthesis essential for workflows. Our analysis indicates that this degradation is driven by the minimum number of tokens required to resolve a query. This factor explains why the high information density inherent in massive tool responses poses a significantly greater challenge than the memory fragmentation typical of long-turn dialogues.