ChatPaper.aiChatPaper

基於視訊基礎模型的物理人工智慧世界模擬

World Simulation with Video Foundation Models for Physical AI

October 28, 2025
作者: NVIDIA, Arslan Ali, Junjie Bai, Maciej Bala, Yogesh Balaji, Aaron Blakeman, Tiffany Cai, Jiaxin Cao, Tianshi Cao, Elizabeth Cha, Yu-Wei Chao, Prithvijit Chattopadhyay, Mike Chen, Yongxin Chen, Yu Chen, Shuai Cheng, Yin Cui, Jenna Diamond, Yifan Ding, Jiaojiao Fan, Linxi Fan, Liang Feng, Francesco Ferroni, Sanja Fidler, Xiao Fu, Ruiyuan Gao, Yunhao Ge, Jinwei Gu, Aryaman Gupta, Siddharth Gururani, Imad El Hanafi, Ali Hassani, Zekun Hao, Jacob Huffman, Joel Jang, Pooya Jannaty, Jan Kautz, Grace Lam, Xuan Li, Zhaoshuo Li, Maosheng Liao, Chen-Hsuan Lin, Tsung-Yi Lin, Yen-Chen Lin, Huan Ling, Ming-Yu Liu, Xian Liu, Yifan Lu, Alice Luo, Qianli Ma, Hanzi Mao, Kaichun Mo, Seungjun Nah, Yashraj Narang, Abhijeet Panaskar, Lindsey Pavao, Trung Pham, Morteza Ramezanali, Fitsum Reda, Scott Reed, Xuanchi Ren, Haonan Shao, Yue Shen, Stella Shi, Shuran Song, Bartosz Stefaniak, Shangkun Sun, Shitao Tang, Sameena Tasmeen, Lyne Tchapmi, Wei-Cheng Tseng, Jibin Varghese, Andrew Z. Wang, Hao Wang, Haoxiang Wang, Heng Wang, Ting-Chun Wang, Fangyin Wei, Jiashu Xu, Dinghao Yang, Xiaodong Yang, Haotian Ye, Seonghyeon Ye, Xiaohui Zeng, Jing Zhang, Qinsheng Zhang, Kaiwen Zheng, Andrew Zhu, Yuke Zhu
cs.AI

摘要

我們推出[Cosmos-Predict2.5]——宇宙世界物理人工智慧基礎模型的最新世代。基於流式架構構建的[Cosmos-Predict2.5]將文字生成世界、圖像生成世界與影片生成世界三大功能整合於單一模型中,並運用物理AI視覺語言模型[Cosmos-Reason1]提供更豐富的文本錨定與更精細的世界模擬控制。該模型在2億支精選影片片段上進行訓練,並透過強化學習式後訓練優化,相較[Cosmos-Predict1]在影片品質與指令遵循方面實現顯著提升,現提供20億與140億參數規模的版本。這些突破性能力可為機器人與自主系統提供更可靠的合成數據生成、策略評估及閉環模擬。我們同時推出控制網路風格的[Cosmos-Transfer2.5]框架,用於模擬到現實與現實到現實的世界轉譯。儘管體積僅為[Cosmos-Transfer1]的1/3.5,其仍能實現更高保真度與強健的長時序影片生成。這些進展共同確立了[Cosmos-Predict2.5]與[Cosmos-Transfer2.5]作為擴展具身智能的通用工具地位。為加速物理AI領域的研究與部署,我們於https://github.com/nvidia-cosmos/cosmos-predict2.5 及 https://github.com/nvidia-cosmos/cosmos-transfer2.5 開源原始碼、預訓練模型與精選基準測試集(採用NVIDIA開放模型許可協議)。期待這些開放資源能降低技術門檻,推動下一代具身智能建設的創新發展。
English
We introduce [Cosmos-Predict2.5], the latest generation of the Cosmos World Foundation Models for Physical AI. Built on a flow-based architecture, [Cosmos-Predict2.5] unifies Text2World, Image2World, and Video2World generation in a single model and leverages [Cosmos-Reason1], a Physical AI vision-language model, to provide richer text grounding and finer control of world simulation. Trained on 200M curated video clips and refined with reinforcement learning-based post-training, [Cosmos-Predict2.5] achieves substantial improvements over [Cosmos-Predict1] in video quality and instruction alignment, with models released at 2B and 14B scales. These capabilities enable more reliable synthetic data generation, policy evaluation, and closed-loop simulation for robotics and autonomous systems. We further extend the family with [Cosmos-Transfer2.5], a control-net style framework for Sim2Real and Real2Real world translation. Despite being 3.5times smaller than [Cosmos-Transfer1], it delivers higher fidelity and robust long-horizon video generation. Together, these advances establish [Cosmos-Predict2.5] and [Cosmos-Transfer2.5] as versatile tools for scaling embodied intelligence. To accelerate research and deployment in Physical AI, we release source code, pretrained checkpoints, and curated benchmarks under the NVIDIA Open Model License at https://github.com/nvidia-cosmos/cosmos-predict2.5 and https://github.com/nvidia-cosmos/cosmos-transfer2.5. We hope these open resources lower the barrier to adoption and foster innovation in building the next generation of embodied intelligence.
PDF401January 19, 2026