UltraHorizon: 초장기 시나리오에서 에이전트 능력 벤치마킹

초록

자율 에이전트는 최근 다양한 분야에서 놀라운 진전을 이루었지만, 대부분의 평가는 단기적이고 완전히 관찰 가능한 작업에 초점을 맞추고 있습니다. 반면, 대규모 소프트웨어 개발, 상업적 투자, 과학적 발견과 같은 많은 중요한 현실 세계의 작업은 장기적이고 부분적으로 관찰 가능한 시나리오에서 전개되며, 성공은 지속적인 추론, 계획, 메모리 관리, 도구 사용에 달려 있습니다. 기존 벤치마크는 이러한 장기적 도전을 거의 포착하지 못해 체계적인 평가에 공백이 있습니다. 이 공백을 메우기 위해, 우리는 복잡한 현실 세계 도전에 필수적인 기초 능력을 측정하는 새로운 벤치마크인 UltraHorizon을 소개합니다. 우리는 탐색을 세 가지 독특한 환경에서 통합 작업으로 사용하여 이러한 핵심 역량을 검증합니다. 에이전트는 장기적 발견 작업에서 설계되며, 지속적인 추론, 계획, 메모리 및 도구 관리, 환경과의 상호작용을 통해 숨겨진 규칙을 반복적으로 발견해야 합니다. 가장 무거운 규모 설정에서 궤적은 평균 200,000개 이상의 토큰과 400개 이상의 도구 호출을 포함하며, 표준 구성에서는 여전히 평균 35,000개 이상의 토큰과 60개 이상의 도구 호출을 포함합니다. 우리의 광범위한 실험은 LLM 에이전트가 이러한 설정에서 일관되게 저조한 성능을 보이는 반면, 인간 참가자는 더 높은 점수를 달성하여 에이전트의 장기적 능력에 지속적인 격차가 있음을 보여줍니다. 또한 우리는 단순한 스케일링이 우리의 작업에서 실패함을 관찰합니다. 에이전트의 실패를 더 잘 설명하기 위해, 우리는 수집된 궤적에 대한 심층 분석을 수행합니다. 우리는 8가지 유형의 오류를 식별하고 이를 두 가지 주요 원인으로 귀속시킵니다: 컨텍스트 잠금과 기능적 기초 능력 격차. https://github.com/StarDewXXX/UltraHorizon{우리의 코드는 여기에서 이용 가능합니다.}

English

Autonomous agents have recently achieved remarkable progress across diverse domains, yet most evaluations focus on short-horizon, fully observable tasks. In contrast, many critical real-world tasks, such as large-scale software development, commercial investment, and scientific discovery, unfold in long-horizon and partially observable scenarios where success hinges on sustained reasoning, planning, memory management, and tool use. Existing benchmarks rarely capture these long-horizon challenges, leaving a gap in systematic evaluation. To bridge this gap, we introduce UltraHorizon a novel benchmark that measures the foundational capabilities essential for complex real-world challenges. We use exploration as a unifying task across three distinct environments to validate these core competencies. Agents are designed in long-horizon discovery tasks where they must iteratively uncover hidden rules through sustained reasoning, planning, memory and tools management, and interaction with environments. Under the heaviest scale setting, trajectories average 200k+ tokens and 400+ tool calls, whereas in standard configurations they still exceed 35k tokens and involve more than 60 tool calls on average. Our extensive experiments reveal that LLM-agents consistently underperform in these settings, whereas human participants achieve higher scores, underscoring a persistent gap in agents' long-horizon abilities. We also observe that simple scaling fails in our task. To better illustrate the failure of agents, we conduct an in-depth analysis of collected trajectories. We identify eight types of errors and attribute them to two primary causes: in-context locking and functional fundamental capability gaps. https://github.com/StarDewXXX/UltraHorizon{Our code will be available here.}

UltraHorizon: 초장기 시나리오에서 에이전트 능력 벤치마킹

UltraHorizon: Benchmarking Agent Capabilities in Ultra Long-Horizon Scenarios

초록

Support