ChatPaper.aiChatPaper

FutureSim:重放世界事件以評估自適應智能體

FutureSim: Replaying World Events to Evaluate Adaptive Agents

May 14, 2026
作者: Shashwat Goel, Nikhil Chandak, Arvindh Arun, Ameya Prabhu, Steffen Staab, Moritz Hardt, Maksym Andriushchenko, Jonas Geiping
cs.AI

摘要

AI智能體正日益被部署於動態、開放式的環境中,這要求它們能隨著新資訊的到來而適應。為了在實際應用場景下有效衡量此能力,我們提出建構植基於現實的模擬,即按事件發生的順序重現真實世界事件。我們建立了FutureSim,在此模擬中,AI智能體在與世界時序重播(即在新聞文章陸續出現、問題於模擬期間逐步解答的過程中)互動的同時,預測超出其知識截止日期後的世界事件。我們在原生測試框架中評估了前沿AI智能體,測試它們在2026年1月至3月這三個月期間預測世界事件的能力。FutureSim揭示了這些AI智能體能力的明顯差異,最佳模型的準確率僅為25%,而許多模型的布賴爾技巧分數甚至比不作任何預測還差。透過仔細的消融實驗,我們展示了FutureSim如何提供一個真實的環境來研究新興研究方向,例如長期測試時適應、搜索、記憶以及不確定性推理。總體而言,我們希望我們的基準設計能為衡量AI在真實世界中跨越長時間範疇的開放式適應能力進展鋪平道路。
English
AI agents are being increasingly deployed in dynamic, open-ended environments that require adapting to new information as it arrives. To efficiently measure this capability for realistic use-cases, we propose building grounded simulations that replay real-world events in the order they occurred. We build FutureSim, where agents forecast world events beyond their knowledge cutoff while interacting with a chronological replay of the world: real news articles arriving and questions resolving over the simulated period. We evaluate frontier agents in their native harness, testing their ability to predict world events over a three-month period from January to March 2026. FutureSim reveals a clear separation in their capabilities, with the best agent's accuracy being 25%, and many having worse Brier skill score than making no prediction at all. Through careful ablations, we show how FutureSim offers a realistic setting to study emerging research directions like long-horizon test-time adaptation, search, memory, and reasoning about uncertainty. Overall, we hope our benchmark design paves the way to measure AI progress on open-ended adaptation spanning long time-horizons in the real world.