ChatPaper.aiChatPaper

世界中的世界:閉環世界中的世界模型

World-in-World: World Models in a Closed-Loop World

October 20, 2025
作者: Jiahan Zhang, Muqing Jiang, Nanru Dai, Taiming Lu, Arda Uzunoglu, Shunchi Zhang, Yana Wei, Jiahao Wang, Vishal M. Patel, Paul Pu Liang, Daniel Khashabi, Cheng Peng, Rama Chellappa, Tianmin Shu, Alan Yuille, Yilun Du, Jieneng Chen
cs.AI

摘要

生成式世界模型(WMs)如今已能模擬出具有驚人視覺真實感的世界,這自然引發了一個問題:它們能否賦予具身代理預測性感知能力以輔助決策?這一問題的進展一直受限於零散的評估:現有的大多數基準測試採用開環協議,孤立地強調視覺質量,而未能解決具身效用的核心問題,即世界模型是否真的能幫助代理成功完成具身任務?為填補這一空白,我們引入了World-in-World,這是首個在閉環世界中對世界模型進行基準測試的開放平台,該世界模擬了真實的代理-環境交互。World-in-World提供了一個統一的在線規劃策略和標準化的行動API,使異構的世界模型能夠用於決策。我們精心設計了四個閉環環境,嚴格評估多樣化的世界模型,將任務成功率作為首要指標,並超越了對視覺質量的普遍關注;我們還首次提出了具身場景下世界模型的數據規模定律。我們的研究揭示了三個令人驚訝的發現:(1)僅視覺質量並不能保證任務成功,可控性更為重要;(2)在訓練後使用行動-觀察數據進行擴展,比升級預訓練的視頻生成器更有效;(3)分配更多的推理時間計算資源,能讓世界模型顯著提升閉環性能。
English
Generative world models (WMs) can now simulate worlds with striking visual realism, which naturally raises the question of whether they can endow embodied agents with predictive perception for decision making. Progress on this question has been limited by fragmented evaluation: most existing benchmarks adopt open-loop protocols that emphasize visual quality in isolation, leaving the core issue of embodied utility unresolved, i.e., do WMs actually help agents succeed at embodied tasks? To address this gap, we introduce World-in-World, the first open platform that benchmarks WMs in a closed-loop world that mirrors real agent-environment interactions. World-in-World provides a unified online planning strategy and a standardized action API, enabling heterogeneous WMs for decision making. We curate four closed-loop environments that rigorously evaluate diverse WMs, prioritize task success as the primary metric, and move beyond the common focus on visual quality; we also present the first data scaling law for world models in embodied settings. Our study uncovers three surprises: (1) visual quality alone does not guarantee task success, controllability matters more; (2) scaling post-training with action-observation data is more effective than upgrading the pretrained video generators; and (3) allocating more inference-time compute allows WMs to substantially improve closed-loop performance.
PDF713October 22, 2025