超遠景：超長時序情境下的智能體能力基準測試

摘要

自主代理系统近期在多个领域取得了显著进展，然而大多数评估仍集中于短期、完全可观测的任务。相比之下，许多关键的现实世界任务，如大规模软件开发、商业投资及科学发现，往往在长期且部分可观测的情境中展开，其成功依赖于持续的推理、规划、记忆管理及工具使用。现有基准测试鲜少涵盖这些长期挑战，导致系统性评估存在空白。为填补这一空白，我们提出了UltraHorizon这一新颖基准，旨在衡量应对复杂现实挑战所需的基础能力。我们以探索作为统一任务，跨越三个不同环境，验证这些核心能力。代理被设计用于长期发现任务中，需通过持续的推理、规划、记忆与工具管理，以及与环境互动，逐步揭示隐藏规则。在最严苛的规模设定下，轨迹平均超过20万标记和400次工具调用，而在标准配置中，仍平均超过3.5万标记和60次工具调用。我们的广泛实验表明，在这些设定下，LLM代理持续表现不佳，而人类参与者则获得更高分数，凸显了代理在长期能力上的持续差距。我们还观察到，简单的规模扩展在我们的任务中失效。为了更好地说明代理的失败，我们对收集的轨迹进行了深入分析，识别出八类错误，并将其归因于两大主要原因：上下文锁定与功能基础能力差距。 https://github.com/StarDewXXX/UltraHorizon{我们的代码将在此处提供。}

English

Autonomous agents have recently achieved remarkable progress across diverse domains, yet most evaluations focus on short-horizon, fully observable tasks. In contrast, many critical real-world tasks, such as large-scale software development, commercial investment, and scientific discovery, unfold in long-horizon and partially observable scenarios where success hinges on sustained reasoning, planning, memory management, and tool use. Existing benchmarks rarely capture these long-horizon challenges, leaving a gap in systematic evaluation. To bridge this gap, we introduce UltraHorizon a novel benchmark that measures the foundational capabilities essential for complex real-world challenges. We use exploration as a unifying task across three distinct environments to validate these core competencies. Agents are designed in long-horizon discovery tasks where they must iteratively uncover hidden rules through sustained reasoning, planning, memory and tools management, and interaction with environments. Under the heaviest scale setting, trajectories average 200k+ tokens and 400+ tool calls, whereas in standard configurations they still exceed 35k tokens and involve more than 60 tool calls on average. Our extensive experiments reveal that LLM-agents consistently underperform in these settings, whereas human participants achieve higher scores, underscoring a persistent gap in agents' long-horizon abilities. We also observe that simple scaling fails in our task. To better illustrate the failure of agents, we conduct an in-depth analysis of collected trajectories. We identify eight types of errors and attribute them to two primary causes: in-context locking and functional fundamental capability gaps. https://github.com/StarDewXXX/UltraHorizon{Our code will be available here.}

超遠景：超長時序情境下的智能體能力基準測試

UltraHorizon: Benchmarking Agent Capabilities in Ultra Long-Horizon Scenarios

摘要

Support