基于视频基础模型的世界仿真:实现物理人工智能
World Simulation with Video Foundation Models for Physical AI
October 28, 2025
作者: NVIDIA, Arslan Ali, Junjie Bai, Maciej Bala, Yogesh Balaji, Aaron Blakeman, Tiffany Cai, Jiaxin Cao, Tianshi Cao, Elizabeth Cha, Yu-Wei Chao, Prithvijit Chattopadhyay, Mike Chen, Yongxin Chen, Yu Chen, Shuai Cheng, Yin Cui, Jenna Diamond, Yifan Ding, Jiaojiao Fan, Linxi Fan, Liang Feng, Francesco Ferroni, Sanja Fidler, Xiao Fu, Ruiyuan Gao, Yunhao Ge, Jinwei Gu, Aryaman Gupta, Siddharth Gururani, Imad El Hanafi, Ali Hassani, Zekun Hao, Jacob Huffman, Joel Jang, Pooya Jannaty, Jan Kautz, Grace Lam, Xuan Li, Zhaoshuo Li, Maosheng Liao, Chen-Hsuan Lin, Tsung-Yi Lin, Yen-Chen Lin, Huan Ling, Ming-Yu Liu, Xian Liu, Yifan Lu, Alice Luo, Qianli Ma, Hanzi Mao, Kaichun Mo, Seungjun Nah, Yashraj Narang, Abhijeet Panaskar, Lindsey Pavao, Trung Pham, Morteza Ramezanali, Fitsum Reda, Scott Reed, Xuanchi Ren, Haonan Shao, Yue Shen, Stella Shi, Shuran Song, Bartosz Stefaniak, Shangkun Sun, Shitao Tang, Sameena Tasmeen, Lyne Tchapmi, Wei-Cheng Tseng, Jibin Varghese, Andrew Z. Wang, Hao Wang, Haoxiang Wang, Heng Wang, Ting-Chun Wang, Fangyin Wei, Jiashu Xu, Dinghao Yang, Xiaodong Yang, Haotian Ye, Seonghyeon Ye, Xiaohui Zeng, Jing Zhang, Qinsheng Zhang, Kaiwen Zheng, Andrew Zhu, Yuke Zhu
cs.AI
摘要
我们推出[Cosmos-Predict2.5]——新一代宇宙世界物理人工智能基础模型。该模型基于流式架构构建,将文本生成世界、图像生成世界和视频生成世界三大功能统一于单一模型中,并利用物理AI视觉语言模型[Cosmos-Reason1]实现更丰富的文本接地与更精细的世界模拟控制。通过2亿条精选视频片段训练及基于强化学习的后训练优化,[Cosmos-Predict2.5]在视频质量与指令对齐方面较[Cosmos-Predict1]实现显著提升,同步发布20亿和140亿参数规模的模型版本。这些能力为机器人及自主系统提供了更可靠的合成数据生成、策略评估与闭环仿真支持。
我们进一步推出控制网络风格框架[Cosmos-Transfer2.5],实现仿真到现实及现实到现实的世界转换。尽管模型规模仅为[Cosmos-Transfer1]的1/3.5,该框架仍能提供更高保真度与强鲁棒性的长时序视频生成能力。这些突破共同确立了[Cosmos-Predict2.5]与[Cosmos-Transfer2.5]作为扩展具身智能的通用工具地位。为加速物理AI领域的研究部署,我们在NVIDIA开放模型许可下开源代码、预训练模型与精选基准测试集(https://github.com/nvidia-cosmos/cosmos-predict2.5 与 https://github.com/nvidia-cosmos/cosmos-transfer2.5)。期待这些开放资源能降低技术应用门槛,推动下一代具身智能建设的创新发展。
English
We introduce [Cosmos-Predict2.5], the latest generation of the Cosmos World
Foundation Models for Physical AI. Built on a flow-based architecture,
[Cosmos-Predict2.5] unifies Text2World, Image2World, and Video2World generation
in a single model and leverages [Cosmos-Reason1], a Physical AI vision-language
model, to provide richer text grounding and finer control of world simulation.
Trained on 200M curated video clips and refined with reinforcement
learning-based post-training, [Cosmos-Predict2.5] achieves substantial
improvements over [Cosmos-Predict1] in video quality and instruction alignment,
with models released at 2B and 14B scales. These capabilities enable more
reliable synthetic data generation, policy evaluation, and closed-loop
simulation for robotics and autonomous systems. We further extend the family
with [Cosmos-Transfer2.5], a control-net style framework for Sim2Real and
Real2Real world translation. Despite being 3.5times smaller than
[Cosmos-Transfer1], it delivers higher fidelity and robust long-horizon video
generation. Together, these advances establish [Cosmos-Predict2.5] and
[Cosmos-Transfer2.5] as versatile tools for scaling embodied intelligence. To
accelerate research and deployment in Physical AI, we release source code,
pretrained checkpoints, and curated benchmarks under the NVIDIA Open Model
License at https://github.com/nvidia-cosmos/cosmos-predict2.5 and
https://github.com/nvidia-cosmos/cosmos-transfer2.5. We hope these open
resources lower the barrier to adoption and foster innovation in building the
next generation of embodied intelligence.