UltraHorizon：超长视野场景下的智能体能力基准测试

摘要

近期，自主智能体在多个领域取得了显著进展，然而大多数评估仍集中于短期、完全可观测的任务。相比之下，许多现实世界中的关键任务，如大规模软件开发、商业投资和科学发现，都是在长期、部分可观测的场景中展开的，其成功依赖于持续的推理、规划、记忆管理及工具使用。现有基准测试很少涵盖这些长期挑战，导致系统性评估存在空白。为填补这一空白，我们推出了UltraHorizon这一新颖的基准测试，旨在衡量应对复杂现实挑战所需的核心能力。我们以探索作为统一任务，在三个不同环境中验证这些核心能力。智能体被设计用于长期发现任务，在此过程中，它们必须通过持续的推理、规划、记忆与工具管理，以及与环境互动，逐步揭示隐藏的规则。在最严苛的规模设置下，轨迹平均超过20万标记和400次工具调用，而在标准配置中，仍平均超过3.5万标记和60次工具调用。我们的大量实验表明，在这些设置下，LLM智能体表现持续不佳，而人类参与者则获得更高分数，凸显了智能体在长期能力上的持续差距。我们还观察到，简单的规模扩展在我们的任务中并不奏效。为了更好地说明智能体的失败原因，我们对收集的轨迹进行了深入分析，识别出八类错误，并将其归因于两大主要原因：上下文锁定和功能基础能力差距。 https://github.com/StarDewXXX/UltraHorizon{我们的代码将在此处提供。}

English

Autonomous agents have recently achieved remarkable progress across diverse domains, yet most evaluations focus on short-horizon, fully observable tasks. In contrast, many critical real-world tasks, such as large-scale software development, commercial investment, and scientific discovery, unfold in long-horizon and partially observable scenarios where success hinges on sustained reasoning, planning, memory management, and tool use. Existing benchmarks rarely capture these long-horizon challenges, leaving a gap in systematic evaluation. To bridge this gap, we introduce UltraHorizon a novel benchmark that measures the foundational capabilities essential for complex real-world challenges. We use exploration as a unifying task across three distinct environments to validate these core competencies. Agents are designed in long-horizon discovery tasks where they must iteratively uncover hidden rules through sustained reasoning, planning, memory and tools management, and interaction with environments. Under the heaviest scale setting, trajectories average 200k+ tokens and 400+ tool calls, whereas in standard configurations they still exceed 35k tokens and involve more than 60 tool calls on average. Our extensive experiments reveal that LLM-agents consistently underperform in these settings, whereas human participants achieve higher scores, underscoring a persistent gap in agents' long-horizon abilities. We also observe that simple scaling fails in our task. To better illustrate the failure of agents, we conduct an in-depth analysis of collected trajectories. We identify eight types of errors and attribute them to two primary causes: in-context locking and functional fundamental capability gaps. https://github.com/StarDewXXX/UltraHorizon{Our code will be available here.}

UltraHorizon：超长视野场景下的智能体能力基准测试

UltraHorizon: Benchmarking Agent Capabilities in Ultra Long-Horizon Scenarios

摘要

Support