RynnVLA-001：利用人类示范提升机器人操作能力

摘要

本文介绍了RynnVLA-001，一种基于大规模人类示范视频生成预训练的视觉-语言-动作（VLA）模型。我们提出了一种新颖的两阶段预训练方法。第一阶段，自我中心视频生成预训练，在1200万条自我中心操作视频上训练图像到视频模型，以初始帧和语言指令为条件预测未来帧。第二阶段，人类中心轨迹感知建模，通过联合预测未来关键点轨迹进一步扩展，从而有效桥接视觉帧预测与动作预测。此外，为增强动作表示，我们提出了ActionVAE，一种变分自编码器，将动作序列压缩为紧凑的潜在嵌入，降低了VLA输出空间的复杂性。在相同下游机器人数据集上微调后，RynnVLA-001相较于现有最先进的基线模型展现出更优性能，证明了所提出的预训练策略为VLA模型提供了更有效的初始化。

English

This paper presents RynnVLA-001, a vision-language-action(VLA) model built upon large-scale video generative pretraining from human demonstrations. We propose a novel two-stage pretraining methodology. The first stage, Ego-Centric Video Generative Pretraining, trains an Image-to-Video model on 12M ego-centric manipulation videos to predict future frames conditioned on an initial frame and a language instruction. The second stage, Human-Centric Trajectory-Aware Modeling, extends this by jointly predicting future keypoint trajectories, thereby effectively bridging visual frame prediction with action prediction. Furthermore, to enhance action representation, we propose ActionVAE, a variational autoencoder that compresses sequences of actions into compact latent embeddings, reducing the complexity of the VLA output space. When finetuned on the same downstream robotics datasets, RynnVLA-001 achieves superior performance over state-of-the-art baselines, demonstrating that the proposed pretraining strategy provides a more effective initialization for VLA models.

RynnVLA-001：利用人类示范提升机器人操作能力

RynnVLA-001: Using Human Demonstrations to Improve Robot Manipulation

摘要

Support