ChatPaper.aiChatPaper

超遠景:超長時序情境下的智能體能力基準測試

UltraHorizon: Benchmarking Agent Capabilities in Ultra Long-Horizon Scenarios

September 26, 2025
作者: Haotian Luo, Huaisong Zhang, Xuelin Zhang, Haoyu Wang, Zeyu Qin, Wenjie Lu, Guozheng Ma, Haiying He, Yingsha Xie, Qiyang Zhou, Zixuan Hu, Hongze Mi, Yibo Wang, Naiqiang Tan, Hong Chen, Yi R. Fung, Chun Yuan, Li Shen
cs.AI

摘要

自主代理系统近期在多个领域取得了显著进展,然而大多数评估仍集中于短期、完全可观测的任务。相比之下,许多关键的现实世界任务,如大规模软件开发、商业投资及科学发现,往往在长期且部分可观测的情境中展开,其成功依赖于持续的推理、规划、记忆管理及工具使用。现有基准测试鲜少涵盖这些长期挑战,导致系统性评估存在空白。为填补这一空白,我们提出了UltraHorizon这一新颖基准,旨在衡量应对复杂现实挑战所需的基础能力。我们以探索作为统一任务,跨越三个不同环境,验证这些核心能力。代理被设计用于长期发现任务中,需通过持续的推理、规划、记忆与工具管理,以及与环境互动,逐步揭示隐藏规则。在最严苛的规模设定下,轨迹平均超过20万标记和400次工具调用,而在标准配置中,仍平均超过3.5万标记和60次工具调用。我们的广泛实验表明,在这些设定下,LLM代理持续表现不佳,而人类参与者则获得更高分数,凸显了代理在长期能力上的持续差距。我们还观察到,简单的规模扩展在我们的任务中失效。为了更好地说明代理的失败,我们对收集的轨迹进行了深入分析,识别出八类错误,并将其归因于两大主要原因:上下文锁定与功能基础能力差距。 https://github.com/StarDewXXX/UltraHorizon{我们的代码将在此处提供。}
English
Autonomous agents have recently achieved remarkable progress across diverse domains, yet most evaluations focus on short-horizon, fully observable tasks. In contrast, many critical real-world tasks, such as large-scale software development, commercial investment, and scientific discovery, unfold in long-horizon and partially observable scenarios where success hinges on sustained reasoning, planning, memory management, and tool use. Existing benchmarks rarely capture these long-horizon challenges, leaving a gap in systematic evaluation. To bridge this gap, we introduce UltraHorizon a novel benchmark that measures the foundational capabilities essential for complex real-world challenges. We use exploration as a unifying task across three distinct environments to validate these core competencies. Agents are designed in long-horizon discovery tasks where they must iteratively uncover hidden rules through sustained reasoning, planning, memory and tools management, and interaction with environments. Under the heaviest scale setting, trajectories average 200k+ tokens and 400+ tool calls, whereas in standard configurations they still exceed 35k tokens and involve more than 60 tool calls on average. Our extensive experiments reveal that LLM-agents consistently underperform in these settings, whereas human participants achieve higher scores, underscoring a persistent gap in agents' long-horizon abilities. We also observe that simple scaling fails in our task. To better illustrate the failure of agents, we conduct an in-depth analysis of collected trajectories. We identify eight types of errors and attribute them to two primary causes: in-context locking and functional fundamental capability gaps. https://github.com/StarDewXXX/UltraHorizon{Our code will be available here.}
PDF232September 29, 2025