FutureSim：重放世界事件以評估自適應智能體

摘要

AI智能體正日益被部署於動態、開放式的環境中，這要求它們能隨著新資訊的到來而適應。為了在實際應用場景下有效衡量此能力，我們提出建構植基於現實的模擬，即按事件發生的順序重現真實世界事件。我們建立了FutureSim，在此模擬中，AI智能體在與世界時序重播（即在新聞文章陸續出現、問題於模擬期間逐步解答的過程中）互動的同時，預測超出其知識截止日期後的世界事件。我們在原生測試框架中評估了前沿AI智能體，測試它們在2026年1月至3月這三個月期間預測世界事件的能力。FutureSim揭示了這些AI智能體能力的明顯差異，最佳模型的準確率僅為25%，而許多模型的布賴爾技巧分數甚至比不作任何預測還差。透過仔細的消融實驗，我們展示了FutureSim如何提供一個真實的環境來研究新興研究方向，例如長期測試時適應、搜索、記憶以及不確定性推理。總體而言，我們希望我們的基準設計能為衡量AI在真實世界中跨越長時間範疇的開放式適應能力進展鋪平道路。

English

AI agents are being increasingly deployed in dynamic, open-ended environments that require adapting to new information as it arrives. To efficiently measure this capability for realistic use-cases, we propose building grounded simulations that replay real-world events in the order they occurred. We build FutureSim, where agents forecast world events beyond their knowledge cutoff while interacting with a chronological replay of the world: real news articles arriving and questions resolving over the simulated period. We evaluate frontier agents in their native harness, testing their ability to predict world events over a three-month period from January to March 2026. FutureSim reveals a clear separation in their capabilities, with the best agent's accuracy being 25%, and many having worse Brier skill score than making no prediction at all. Through careful ablations, we show how FutureSim offers a realistic setting to study emerging research directions like long-horizon test-time adaptation, search, memory, and reasoning about uncertainty. Overall, we hope our benchmark design paves the way to measure AI progress on open-ended adaptation spanning long time-horizons in the real world.