VideoVLA:视频生成模型可成为通用化机器人操控器
VideoVLA: Video Generators Can Be Generalizable Robot Manipulators
December 7, 2025
作者: Yichao Shen, Fangyun Wei, Zhiying Du, Yaobo Liang, Yan Lu, Jiaolong Yang, Nanning Zheng, Baining Guo
cs.AI
摘要
机器人操作中的泛化能力对于在开放世界环境中部署机器人及实现通用人工智能至关重要。虽然当前视觉-语言-动作模型利用大规模预训练理解模型实现感知和指令跟随,但其在新任务、新物体和新环境中的泛化能力仍显不足。本研究提出VideoVLA创新方案,探索将大型视频生成模型转化为机器人操作器的潜力。该系统基于语言指令和初始图像,可同步预测动作序列及未来视觉结果。通过构建多模态扩散变换器架构,VideoVLA融合视频、语言与动作模态的联合建模,并借助预训练视频生成模型实现视觉与动作的协同预测。实验表明,高质量的未来场景想象与可靠的动作预测及任务成功率呈正相关,印证了视觉想象力在操作中的关键作用。VideoVLA展现出卓越的泛化性能,包括跨实体技能模仿和新物体操作能力。这种动作与视觉结果双重预测的策略,开创了机器人学习新范式,为操作系统解锁了泛化能力的新维度。
English
Generalization in robot manipulation is essential for deploying robots in open-world environments and advancing toward artificial general intelligence. While recent Vision-Language-Action (VLA) models leverage large pre-trained understanding models for perception and instruction following, their ability to generalize to novel tasks, objects, and settings remains limited. In this work, we present VideoVLA, a simple approach that explores the potential of transforming large video generation models into robotic VLA manipulators. Given a language instruction and an image, VideoVLA predicts an action sequence as well as the future visual outcomes. Built on a multi-modal Diffusion Transformer, VideoVLA jointly models video, language, and action modalities, using pre-trained video generative models for joint visual and action forecasting. Our experiments show that high-quality imagined futures correlate with reliable action predictions and task success, highlighting the importance of visual imagination in manipulation. VideoVLA demonstrates strong generalization, including imitating other embodiments' skills and handling novel objects. This dual-prediction strategy - forecasting both actions and their visual consequences - explores a paradigm shift in robot learning and unlocks generalization capabilities in manipulation systems.