UltraHorizon: 超長期シナリオにおけるエージェント能力のベンチマーキング

要旨

自律エージェントは近年、多様な分野で顕著な進歩を遂げているが、その評価のほとんどは短期的で完全に観測可能なタスクに焦点を当てている。一方、大規模なソフトウェア開発、商業投資、科学的発見など、多くの重要な現実世界のタスクは、長期的で部分的に観測可能なシナリオで展開され、成功は持続的な推論、計画、メモリ管理、ツールの使用にかかっている。既存のベンチマークはこれらの長期的な課題をほとんど捉えておらず、体系的な評価にギャップが生じている。このギャップを埋めるため、我々は複雑な現実世界の課題に不可欠な基礎能力を測定する新しいベンチマーク「UltraHorizon」を導入する。我々は、3つの異なる環境にわたる探索タスクを統一的な課題として使用し、これらの中核能力を検証する。エージェントは、持続的な推論、計画、メモリとツールの管理、環境との相互作用を通じて隠れたルールを反復的に発見しなければならない長期的な発見タスクに設計されている。最も大規模な設定では、軌跡は平均20万以上のトークンと400以上のツール呼び出しを記録し、標準設定でも平均3万5千以上のトークンと60以上のツール呼び出しを伴う。我々の広範な実験は、LLMエージェントがこれらの設定で一貫して低いパフォーマンスを示すのに対し、人間の参加者はより高いスコアを達成し、エージェントの長期的な能力における持続的なギャップを浮き彫りにしている。また、単純なスケーリングが我々のタスクでは失敗することを観察した。エージェントの失敗をより明確に示すため、収集した軌跡の詳細な分析を行い、8種類のエラーを特定し、それらを2つの主要な原因に帰属させた：コンテキスト内のロックと機能的な基礎能力のギャップである。 https://github.com/StarDewXXX/UltraHorizon{我々のコードはここで利用可能になる。}

English

Autonomous agents have recently achieved remarkable progress across diverse domains, yet most evaluations focus on short-horizon, fully observable tasks. In contrast, many critical real-world tasks, such as large-scale software development, commercial investment, and scientific discovery, unfold in long-horizon and partially observable scenarios where success hinges on sustained reasoning, planning, memory management, and tool use. Existing benchmarks rarely capture these long-horizon challenges, leaving a gap in systematic evaluation. To bridge this gap, we introduce UltraHorizon a novel benchmark that measures the foundational capabilities essential for complex real-world challenges. We use exploration as a unifying task across three distinct environments to validate these core competencies. Agents are designed in long-horizon discovery tasks where they must iteratively uncover hidden rules through sustained reasoning, planning, memory and tools management, and interaction with environments. Under the heaviest scale setting, trajectories average 200k+ tokens and 400+ tool calls, whereas in standard configurations they still exceed 35k tokens and involve more than 60 tool calls on average. Our extensive experiments reveal that LLM-agents consistently underperform in these settings, whereas human participants achieve higher scores, underscoring a persistent gap in agents' long-horizon abilities. We also observe that simple scaling fails in our task. To better illustrate the failure of agents, we conduct an in-depth analysis of collected trajectories. We identify eight types of errors and attribute them to two primary causes: in-context locking and functional fundamental capability gaps. https://github.com/StarDewXXX/UltraHorizon{Our code will be available here.}

UltraHorizon: 超長期シナリオにおけるエージェント能力のベンチマーキング

UltraHorizon: Benchmarking Agent Capabilities in Ultra Long-Horizon Scenarios

要旨

Support