ChatPaper.aiChatPaper

世界中的世界:闭环世界中的世界模型

World-in-World: World Models in a Closed-Loop World

October 20, 2025
作者: Jiahan Zhang, Muqing Jiang, Nanru Dai, Taiming Lu, Arda Uzunoglu, Shunchi Zhang, Yana Wei, Jiahao Wang, Vishal M. Patel, Paul Pu Liang, Daniel Khashabi, Cheng Peng, Rama Chellappa, Tianmin Shu, Alan Yuille, Yilun Du, Jieneng Chen
cs.AI

摘要

生成式世界模型(WMs)如今已能模拟出具有惊人视觉真实感的世界,这自然引发了一个问题:它们能否为具身智能体赋予预测性感知能力,以辅助决策?然而,这一问题的研究进展因评估体系的碎片化而受限:现有基准大多采用开环协议,孤立地强调视觉质量,而忽视了具身效用的核心问题——即世界模型是否真能帮助智能体在具身任务中取得成功?为填补这一空白,我们推出了“世界中的世界”(World-in-World),这是首个在闭环环境中对世界模型进行基准测试的开放平台,该环境真实模拟了智能体与环境的交互。World-in-World提供了一套统一的在线规划策略和标准化的动作API,使得异构的世界模型能够用于决策。我们精心设计了四个闭环环境,严格评估了多种世界模型,将任务成功率作为首要指标,并超越了仅关注视觉质量的常规做法;同时,我们还首次提出了具身场景下世界模型的数据缩放定律。我们的研究揭示了三个意外发现:(1)仅凭视觉质量无法保证任务成功,可控性更为关键;(2)利用动作-观察数据进行训练后缩放,比升级预训练的视频生成器更为有效;(3)分配更多的推理计算资源,可显著提升世界模型在闭环环境中的表现。
English
Generative world models (WMs) can now simulate worlds with striking visual realism, which naturally raises the question of whether they can endow embodied agents with predictive perception for decision making. Progress on this question has been limited by fragmented evaluation: most existing benchmarks adopt open-loop protocols that emphasize visual quality in isolation, leaving the core issue of embodied utility unresolved, i.e., do WMs actually help agents succeed at embodied tasks? To address this gap, we introduce World-in-World, the first open platform that benchmarks WMs in a closed-loop world that mirrors real agent-environment interactions. World-in-World provides a unified online planning strategy and a standardized action API, enabling heterogeneous WMs for decision making. We curate four closed-loop environments that rigorously evaluate diverse WMs, prioritize task success as the primary metric, and move beyond the common focus on visual quality; we also present the first data scaling law for world models in embodied settings. Our study uncovers three surprises: (1) visual quality alone does not guarantee task success, controllability matters more; (2) scaling post-training with action-observation data is more effective than upgrading the pretrained video generators; and (3) allocating more inference-time compute allows WMs to substantially improve closed-loop performance.
PDF713October 22, 2025