宇宙3：面向物理人工智慧的全模態世界模型

摘要

我們推出 Cosmos 3，一個全模態世界模型系列，旨在統一的混合變換器架構中聯合處理並生成語言、圖像、影片、音訊和動作序列。透過支援高度靈活的輸入輸出配置，Cosmos 3 無縫整合了物理 AI 的關鍵模態——有效將視覺語言模型、影片生成器、世界模擬器和世界動作模型納入單一架構。我們的評估顯示，Cosmos 3 在各種理解與生成任務中樹立了新的業界標竿，證明全模態世界模型可作為具身智能體的可擴展通用骨幹。根據技術報告撰寫時的評估，經過後訓練的 Cosmos 3 模型被 Artificial Analysis 評選為最佳開源文字到圖像及圖像到影片模型，並被 RoboArena 評為最佳策略模型。為加速物理 AI 的開放研究與部署，我們在 Linux 基金會的 OpenMDW-1.1 授權條款 (https://openmdw.ai/license/1-1/) 下，公開程式碼、模型檢查點、精選合成資料集與評估基準，網址為 https://github.com/nvidia/cosmos 及 https://huggingface.co/collections/nvidia/cosmos3。專案網站為 https://research.nvidia.com/labs/cosmos-lab/cosmos3。

English

We introduce Cosmos 3, a family of omnimodal world models designed to jointly process and generate language, image, video, audio, and action sequences within a unified mixture-of-transformers architecture. By supporting highly flexible input-output configurations, Cosmos 3 seamlessly unifies critical modalities for Physical AI -- effectively subsuming vision-language models, video generators, world simulators, and world-action models into a single framework. Our evaluation demonstrates that Cosmos 3 establishes a new state-of-the-art across a diverse suite of understanding and generation tasks, demonstrating omnimodal world models as scalable, general-purpose backbones for embodied agents. Our post-trained Cosmos 3 models were ranked as the best open-source Text-to-Image and Image-to-Video models by Artificial Analysis, and the best policy model by RoboArena at the time the technical report was written. To accelerate open research and deployment in Physical AI, we make our code, model checkpoints, curated synthetic datasets, and evaluation benchmark available under the Linux Foundation's OpenMDW-1.1 https://openmdw.ai/license/1-1/ License at https://github.com/nvidia/cosmos}{github.com/nvidia/cosmos and https://huggingface.co/collections/nvidia/cosmos3 . The project website is available at https://research.nvidia.com/labs/cosmos-lab/cosmos3 .