ChatPaper.aiChatPaper

UltraHorizon:超长视野场景下的智能体能力基准测试

UltraHorizon: Benchmarking Agent Capabilities in Ultra Long-Horizon Scenarios

September 26, 2025
作者: Haotian Luo, Huaisong Zhang, Xuelin Zhang, Haoyu Wang, Zeyu Qin, Wenjie Lu, Guozheng Ma, Haiying He, Yingsha Xie, Qiyang Zhou, Zixuan Hu, Hongze Mi, Yibo Wang, Naiqiang Tan, Hong Chen, Yi R. Fung, Chun Yuan, Li Shen
cs.AI

摘要

近期,自主智能体在多个领域取得了显著进展,然而大多数评估仍集中于短期、完全可观测的任务。相比之下,许多现实世界中的关键任务,如大规模软件开发、商业投资和科学发现,都是在长期、部分可观测的场景中展开的,其成功依赖于持续的推理、规划、记忆管理及工具使用。现有基准测试很少涵盖这些长期挑战,导致系统性评估存在空白。为填补这一空白,我们推出了UltraHorizon这一新颖的基准测试,旨在衡量应对复杂现实挑战所需的核心能力。我们以探索作为统一任务,在三个不同环境中验证这些核心能力。智能体被设计用于长期发现任务,在此过程中,它们必须通过持续的推理、规划、记忆与工具管理,以及与环境互动,逐步揭示隐藏的规则。在最严苛的规模设置下,轨迹平均超过20万标记和400次工具调用,而在标准配置中,仍平均超过3.5万标记和60次工具调用。我们的大量实验表明,在这些设置下,LLM智能体表现持续不佳,而人类参与者则获得更高分数,凸显了智能体在长期能力上的持续差距。我们还观察到,简单的规模扩展在我们的任务中并不奏效。为了更好地说明智能体的失败原因,我们对收集的轨迹进行了深入分析,识别出八类错误,并将其归因于两大主要原因:上下文锁定和功能基础能力差距。 https://github.com/StarDewXXX/UltraHorizon{我们的代码将在此处提供。}
English
Autonomous agents have recently achieved remarkable progress across diverse domains, yet most evaluations focus on short-horizon, fully observable tasks. In contrast, many critical real-world tasks, such as large-scale software development, commercial investment, and scientific discovery, unfold in long-horizon and partially observable scenarios where success hinges on sustained reasoning, planning, memory management, and tool use. Existing benchmarks rarely capture these long-horizon challenges, leaving a gap in systematic evaluation. To bridge this gap, we introduce UltraHorizon a novel benchmark that measures the foundational capabilities essential for complex real-world challenges. We use exploration as a unifying task across three distinct environments to validate these core competencies. Agents are designed in long-horizon discovery tasks where they must iteratively uncover hidden rules through sustained reasoning, planning, memory and tools management, and interaction with environments. Under the heaviest scale setting, trajectories average 200k+ tokens and 400+ tool calls, whereas in standard configurations they still exceed 35k tokens and involve more than 60 tool calls on average. Our extensive experiments reveal that LLM-agents consistently underperform in these settings, whereas human participants achieve higher scores, underscoring a persistent gap in agents' long-horizon abilities. We also observe that simple scaling fails in our task. To better illustrate the failure of agents, we conduct an in-depth analysis of collected trajectories. We identify eight types of errors and attribute them to two primary causes: in-context locking and functional fundamental capability gaps. https://github.com/StarDewXXX/UltraHorizon{Our code will be available here.}
PDF232September 29, 2025