ChatPaper.aiChatPaper

Cosmos-Reason1:從物理常識到具身推理

Cosmos-Reason1: From Physical Common Sense To Embodied Reasoning

March 18, 2025
作者: NVIDIA, Alisson Azzolini, Hannah Brandon, Prithvijit Chattopadhyay, Huayu Chen, Jinju Chu, Yin Cui, Jenna Diamond, Yifan Ding, Francesco Ferroni, Rama Govindaraju, Jinwei Gu, Siddharth Gururani, Imad El Hanafi, Zekun Hao, Jacob Huffman, Jingyi Jin, Brendan Johnson, Rizwan Khan, George Kurian, Elena Lantz, Nayeon Lee, Zhaoshuo Li, Xuan Li, Tsung-Yi Lin, Yen-Chen Lin, Ming-Yu Liu, Andrew Mathau, Yun Ni, Lindsey Pavao, Wei Ping, David W. Romero, Misha Smelyanskiy, Shuran Song, Lyne Tchapmi, Andrew Z. Wang, Boxin Wang, Haoxiang Wang, Fangyin Wei, Jiashu Xu, Yao Xu, Xiaodong Yang, Zhuolin Yang, Xiaohui Zeng, Zhe Zhang
cs.AI

摘要

物理AI系統需要感知、理解並在物理世界中執行複雜的行動。本文中,我們介紹了Cosmos-Reason1模型,該模型能夠理解物理世界,並通過長鏈思維推理過程以自然語言生成適當的具身決策(例如,下一步行動)。我們首先定義了物理AI推理的關鍵能力,重點關注物理常識和具身推理。為了表示物理常識,我們使用了一個分層本體,捕捉關於空間、時間和物理的基本知識。對於具身推理,我們依賴於一個二維本體,該本體能夠泛化到不同的物理具身形式。基於這些能力,我們開發了兩個多模態大型語言模型,Cosmos-Reason1-8B和Cosmos-Reason1-56B。我們在四個階段中整理數據並訓練我們的模型:視覺預訓練、通用監督微調(SFT)、物理AI SFT以及作為後訓練的物理AI強化學習(RL)。為了評估我們的模型,我們根據我們的本體構建了全面的物理常識和具身推理基準。評估結果顯示,物理AI SFT和強化學習帶來了顯著的改進。為了促進物理AI的發展,我們將在NVIDIA開放模型許可下,於https://github.com/nvidia-cosmos/cosmos-reason1提供我們的代碼和預訓練模型。
English
Physical AI systems need to perceive, understand, and perform complex actions in the physical world. In this paper, we present the Cosmos-Reason1 models that can understand the physical world and generate appropriate embodied decisions (e.g., next step action) in natural language through long chain-of-thought reasoning processes. We begin by defining key capabilities for Physical AI reasoning, with a focus on physical common sense and embodied reasoning. To represent physical common sense, we use a hierarchical ontology that captures fundamental knowledge about space, time, and physics. For embodied reasoning, we rely on a two-dimensional ontology that generalizes across different physical embodiments. Building on these capabilities, we develop two multimodal large language models, Cosmos-Reason1-8B and Cosmos-Reason1-56B. We curate data and train our models in four stages: vision pre-training, general supervised fine-tuning (SFT), Physical AI SFT, and Physical AI reinforcement learning (RL) as the post-training. To evaluate our models, we build comprehensive benchmarks for physical common sense and embodied reasoning according to our ontologies. Evaluation results show that Physical AI SFT and reinforcement learning bring significant improvements. To facilitate the development of Physical AI, we will make our code and pre-trained models available under the NVIDIA Open Model License at https://github.com/nvidia-cosmos/cosmos-reason1.

Summary

AI-Generated Summary

PDF462March 21, 2025