ChatPaper.aiChatPaper

RynnVLA-001:利用人类示范提升机器人操作能力

RynnVLA-001: Using Human Demonstrations to Improve Robot Manipulation

September 18, 2025
作者: Yuming Jiang, Siteng Huang, Shengke Xue, Yaxi Zhao, Jun Cen, Sicong Leng, Kehan Li, Jiayan Guo, Kexiang Wang, Mingxiu Chen, Fan Wang, Deli Zhao, Xin Li
cs.AI

摘要

本文介绍了RynnVLA-001,一种基于大规模人类示范视频生成预训练的视觉-语言-动作(VLA)模型。我们提出了一种新颖的两阶段预训练方法。第一阶段,自我中心视频生成预训练,在1200万条自我中心操作视频上训练图像到视频模型,以初始帧和语言指令为条件预测未来帧。第二阶段,人类中心轨迹感知建模,通过联合预测未来关键点轨迹进一步扩展,从而有效桥接视觉帧预测与动作预测。此外,为增强动作表示,我们提出了ActionVAE,一种变分自编码器,将动作序列压缩为紧凑的潜在嵌入,降低了VLA输出空间的复杂性。在相同下游机器人数据集上微调后,RynnVLA-001相较于现有最先进的基线模型展现出更优性能,证明了所提出的预训练策略为VLA模型提供了更有效的初始化。
English
This paper presents RynnVLA-001, a vision-language-action(VLA) model built upon large-scale video generative pretraining from human demonstrations. We propose a novel two-stage pretraining methodology. The first stage, Ego-Centric Video Generative Pretraining, trains an Image-to-Video model on 12M ego-centric manipulation videos to predict future frames conditioned on an initial frame and a language instruction. The second stage, Human-Centric Trajectory-Aware Modeling, extends this by jointly predicting future keypoint trajectories, thereby effectively bridging visual frame prediction with action prediction. Furthermore, to enhance action representation, we propose ActionVAE, a variational autoencoder that compresses sequences of actions into compact latent embeddings, reducing the complexity of the VLA output space. When finetuned on the same downstream robotics datasets, RynnVLA-001 achieves superior performance over state-of-the-art baselines, demonstrating that the proposed pretraining strategy provides a more effective initialization for VLA models.
PDF202September 19, 2025