宇宙3：用于物理人工智能的全模态世界模型

摘要

我们推出了 Cosmos 3——一个全模态世界模型系列，旨在通过统一的混合变换器架构，联合处理并生成语言、图像、视频、音频及动作序列。通过支持高度灵活的输入输出配置，Cosmos 3 无缝融合了物理 AI 的关键模态，将视觉语言模型、视频生成器、世界模拟器及世界动作模型有效整合于单一框架之中。评估结果显示，Cosmos 3 在众多理解与生成任务中均达到了新的最优水平，证明了全模态世界模型可作为具身智能体可扩展的通用骨干网络。在技术报告撰写时，我们的后训练版 Cosmos 3 模型被 Artificial Analysis 评为最佳开源文本到图像及图像到视频模型，并被 RoboArena 评为最佳策略模型。为加速物理 AI 领域的开放研究与部署，我们在 Linux 基金会 OpenMDW-1.1 许可协议（https://openmdw.ai/license/1-1/）下，公开了代码、模型检查点、精心整理的合成数据集及评估基准，相关资源可通过 https://github.com/nvidia/cosmos 和 https://huggingface.co/collections/nvidia/cosmos3 获取。项目网站地址为 https://research.nvidia.com/labs/cosmos-lab/cosmos3。

English

We introduce Cosmos 3, a family of omnimodal world models designed to jointly process and generate language, image, video, audio, and action sequences within a unified mixture-of-transformers architecture. By supporting highly flexible input-output configurations, Cosmos 3 seamlessly unifies critical modalities for Physical AI -- effectively subsuming vision-language models, video generators, world simulators, and world-action models into a single framework. Our evaluation demonstrates that Cosmos 3 establishes a new state-of-the-art across a diverse suite of understanding and generation tasks, demonstrating omnimodal world models as scalable, general-purpose backbones for embodied agents. Our post-trained Cosmos 3 models were ranked as the best open-source Text-to-Image and Image-to-Video models by Artificial Analysis, and the best policy model by RoboArena at the time the technical report was written. To accelerate open research and deployment in Physical AI, we make our code, model checkpoints, curated synthetic datasets, and evaluation benchmark available under the Linux Foundation's OpenMDW-1.1 https://openmdw.ai/license/1-1/ License at https://github.com/nvidia/cosmos}{github.com/nvidia/cosmos and https://huggingface.co/collections/nvidia/cosmos3 . The project website is available at https://research.nvidia.com/labs/cosmos-lab/cosmos3 .