ChatPaper.aiChatPaper

宇宙策略:面向视觉运动控制与规划的视频模型微调

Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning

January 22, 2026
作者: Moo Jin Kim, Yihuai Gao, Tsung-Yi Lin, Yen-Chen Lin, Yunhao Ge, Grace Lam, Percy Liang, Shuran Song, Ming-Yu Liu, Chelsea Finn, Jinwei Gu
cs.AI

摘要

近期视频生成模型展现出卓越的能力,能够捕捉复杂的物理交互和场景随时间演变的规律。为利用其时空先验知识,机器人研究领域已尝试将视频模型应用于策略学习,但这类方法通常需要多阶段后训练和新增动作生成架构组件,导致系统复杂性增加。本研究提出Cosmos策略——一种通过单阶段后训练将大型预训练视频模型(Cosmos-Predict2)适配为高效机器人策略的简洁方案。该方案仅需在目标平台采集的机器人演示数据上进行训练,无需修改模型架构。Cosmos策略通过视频模型的潜在扩散过程,学习直接生成编码为潜在帧的机器人动作,充分利用模型的预训练先验知识和核心学习算法来捕捉复杂的动作分布。此外,该方法还能生成同样编码为潜在帧的未来状态图像与价值函数(预期累积奖励),从而在测试阶段通过基于模型的轨迹规划提升任务成功率。实验表明,Cosmos策略在LIBERO和RoboCasa仿真基准测试中分别达到98.5%和67.1%的平均成功率,实现领先性能;在具挑战性的真实世界双手操作任务中取得最高平均分,显著优于从头训练的扩散策略、基于视频模型的策略以及在相同机器人演示数据上微调的前沿视觉-语言-动作模型。值得注意的是,给定策略推演数据后,Cosmos策略还能通过经验学习优化其世界模型与价值函数,并借助基于模型的规划在复杂任务中实现更高成功率。相关代码、模型及训练数据已发布于https://research.nvidia.com/labs/dir/cosmos-policy/。
English
Recent video generation models demonstrate remarkable ability to capture complex physical interactions and scene evolution over time. To leverage their spatiotemporal priors, robotics works have adapted video models for policy learning but introduce complexity by requiring multiple stages of post-training and new architectural components for action generation. In this work, we introduce Cosmos Policy, a simple approach for adapting a large pretrained video model (Cosmos-Predict2) into an effective robot policy through a single stage of post-training on the robot demonstration data collected on the target platform, with no architectural modifications. Cosmos Policy learns to directly generate robot actions encoded as latent frames within the video model's latent diffusion process, harnessing the model's pretrained priors and core learning algorithm to capture complex action distributions. Additionally, Cosmos Policy generates future state images and values (expected cumulative rewards), which are similarly encoded as latent frames, enabling test-time planning of action trajectories with higher likelihood of success. In our evaluations, Cosmos Policy achieves state-of-the-art performance on the LIBERO and RoboCasa simulation benchmarks (98.5% and 67.1% average success rates, respectively) and the highest average score in challenging real-world bimanual manipulation tasks, outperforming strong diffusion policies trained from scratch, video model-based policies, and state-of-the-art vision-language-action models fine-tuned on the same robot demonstrations. Furthermore, given policy rollout data, Cosmos Policy can learn from experience to refine its world model and value function and leverage model-based planning to achieve even higher success rates in challenging tasks. We release code, models, and training data at https://research.nvidia.com/labs/dir/cosmos-policy/
PDF61January 24, 2026