FutureSim: 世界イベントのリプレイによる適応的エージェントの評価

要旨

AIエージェントは、新たな情報が到着するたびに適応する必要がある動的で開かれた環境にますます展開されている。現実的なユースケースにおいてこの能力を効率的に測定するために、実世界の出来事を発生順に再現する grounded simulation の構築を提案する。我々は FutureSim を構築する。このシミュレーションでは、エージェントが知識のカットオフを超えた世界の出来事を予測しながら、世界の時系列的な再現（シミュレーション期間中に到着する実際のニュース記事と解決される質問）と対話する。我々は、フロンティアエージェントを本来のハーネスで評価し、2026年1月から3月までの3ヶ月間にわたって世界の出来事を予測する能力をテストする。FutureSim はそれらの能力に明確な差を示し、最良のエージェントの精度は25%であり、多くのエージェントは全く予測しない場合よりもブライアスキルスコアが悪い。慎重なアブレーションを通じて、FutureSim が長期にわたるテスト時間適応、検索、記憶、不確実性に関する推論といった新興研究の方向性を研究するための現実的な設定を提供することを示す。全体として、我々のベンチマーク設計が、実世界における長い時間軸にわたる開かれた適応に関するAIの進歩を測定する道を開くことを期待している。

English

AI agents are being increasingly deployed in dynamic, open-ended environments that require adapting to new information as it arrives. To efficiently measure this capability for realistic use-cases, we propose building grounded simulations that replay real-world events in the order they occurred. We build FutureSim, where agents forecast world events beyond their knowledge cutoff while interacting with a chronological replay of the world: real news articles arriving and questions resolving over the simulated period. We evaluate frontier agents in their native harness, testing their ability to predict world events over a three-month period from January to March 2026. FutureSim reveals a clear separation in their capabilities, with the best agent's accuracy being 25%, and many having worse Brier skill score than making no prediction at all. Through careful ablations, we show how FutureSim offers a realistic setting to study emerging research directions like long-horizon test-time adaptation, search, memory, and reasoning about uncertainty. Overall, we hope our benchmark design paves the way to measure AI progress on open-ended adaptation spanning long time-horizons in the real world.