VideoVLA:视频生成模型可成为通用化机器人操控器
VideoVLA: Video Generators Can Be Generalizable Robot Manipulators
December 7, 2025
作者: Yichao Shen, Fangyun Wei, Zhiying Du, Yaobo Liang, Yan Lu, Jiaolong Yang, Nanning Zheng, Baining Guo
cs.AI
摘要
机器人操作中的泛化能力对于在开放世界部署机器人及实现通用人工智能至关重要。尽管当前视觉-语言-动作模型利用大型预训练理解模型实现感知和指令跟随,但其在新任务、新物体和新环境中的泛化能力仍显不足。本文提出VideoVLA——一种将大型视频生成模型转化为机器人操作器的简易框架。该系统接收语言指令和图像输入,可同步预测动作序列及未来视觉结果。基于多模态扩散Transformer架构,VideoVLA通过预训练视频生成模型实现视觉与动作的联合建模。实验表明,高质量的视觉想象与可靠的动作预测及任务成功率呈正相关,印证了视觉想象在操作中的关键作用。该框架展现出强大的泛化能力,包括模仿其他智能体的技能和处理新异物体。这种动作与视觉结果的双重预测策略,探索了机器人学习范式的转变,为操作系统解锁了新的泛化潜能。
English
Generalization in robot manipulation is essential for deploying robots in open-world environments and advancing toward artificial general intelligence. While recent Vision-Language-Action (VLA) models leverage large pre-trained understanding models for perception and instruction following, their ability to generalize to novel tasks, objects, and settings remains limited. In this work, we present VideoVLA, a simple approach that explores the potential of transforming large video generation models into robotic VLA manipulators. Given a language instruction and an image, VideoVLA predicts an action sequence as well as the future visual outcomes. Built on a multi-modal Diffusion Transformer, VideoVLA jointly models video, language, and action modalities, using pre-trained video generative models for joint visual and action forecasting. Our experiments show that high-quality imagined futures correlate with reliable action predictions and task success, highlighting the importance of visual imagination in manipulation. VideoVLA demonstrates strong generalization, including imitating other embodiments' skills and handling novel objects. This dual-prediction strategy - forecasting both actions and their visual consequences - explores a paradigm shift in robot learning and unlocks generalization capabilities in manipulation systems.