FutureSim: 세계 사건 재생을 통한 적응형 에이전트 평가

초록

AI 에이전트가 동적이고 개방적인 환경에 점점 더 많이 배포되면서, 새로운 정보가 도착함에 따라 이에 적응하는 능력이 요구되고 있다. 현실적인 사용 사례에서 이러한 역량을 효율적으로 측정하기 위해, 우리는 실제 세계 사건이 발생한 순서대로 재연하는 실제 기반 시뮬레이션을 구축할 것을 제안한다. 우리는 FutureSim을 구축하였으며, 이 시뮬레이션에서 에이전트는 시뮬레이션 기간 동안 도착하는 실제 뉴스 기사와 해결되는 질문 등 세계의 시간순 재연과 상호작용하면서 자신의 지식 범위 한계를 넘어서는 세계 사건을 예측한다. 우리는 최첨단 에이전트들을 자체 평가 환경에서 평가하며, 2026년 1월부터 3월까지 3개월 기간 동안 세계 사건을 예측하는 능력을 테스트한다. FutureSim은 에이전트들의 역량에서 뚜렷한 차이를 드러내는데, 최고 성능 에이전트의 정확도는 25%였으며, 많은 에이전트는 전혀 예측을 하지 않는 것보다 더 나쁜 브라이어 기술 점수를 기록했다. 면밀한 제거 실험을 통해, 우리는 FutureSim이 장기적 테스트 시간 적응, 탐색, 메모리, 불확실성에 대한 추론과 같은 새로운 연구 방향을 연구할 수 있는 현실적인 환경을 제공함을 보여준다. 전반적으로, 우리의 벤치마크 설계가 현실 세계에서 장기적 시간 범위에 걸친 개방형 적응에 대한 AI 발전을 측정하는 길을 열기를 기대한다.

English

AI agents are being increasingly deployed in dynamic, open-ended environments that require adapting to new information as it arrives. To efficiently measure this capability for realistic use-cases, we propose building grounded simulations that replay real-world events in the order they occurred. We build FutureSim, where agents forecast world events beyond their knowledge cutoff while interacting with a chronological replay of the world: real news articles arriving and questions resolving over the simulated period. We evaluate frontier agents in their native harness, testing their ability to predict world events over a three-month period from January to March 2026. FutureSim reveals a clear separation in their capabilities, with the best agent's accuracy being 25%, and many having worse Brier skill score than making no prediction at all. Through careful ablations, we show how FutureSim offers a realistic setting to study emerging research directions like long-horizon test-time adaptation, search, memory, and reasoning about uncertainty. Overall, we hope our benchmark design paves the way to measure AI progress on open-ended adaptation spanning long time-horizons in the real world.