RynnVLA-001:利用人類示範提升機器人操作能力
RynnVLA-001: Using Human Demonstrations to Improve Robot Manipulation
September 18, 2025
作者: Yuming Jiang, Siteng Huang, Shengke Xue, Yaxi Zhao, Jun Cen, Sicong Leng, Kehan Li, Jiayan Guo, Kexiang Wang, Mingxiu Chen, Fan Wang, Deli Zhao, Xin Li
cs.AI
摘要
本文介紹了RynnVLA-001,這是一個基於大規模人類示範視頻生成預訓練的視覺-語言-動作(VLA)模型。我們提出了一種新穎的兩階段預訓練方法。第一階段,自我中心視頻生成預訓練,在1200萬個自我中心操作視頻上訓練一個圖像到視頻模型,以根據初始幀和語言指令預測未來幀。第二階段,人類中心軌跡感知建模,通過聯合預測未來關鍵點軌跡來擴展這一方法,從而有效地將視覺幀預測與動作預測相結合。此外,為了增強動作表示,我們提出了ActionVAE,這是一種變分自編碼器,將動作序列壓縮為緊湊的潛在嵌入,降低了VLA輸出空間的複雜性。在相同的下游機器人數據集上進行微調時,RynnVLA-001在性能上超越了最先進的基線模型,證明了所提出的預訓練策略為VLA模型提供了更有效的初始化。
English
This paper presents RynnVLA-001, a vision-language-action(VLA) model built
upon large-scale video generative pretraining from human demonstrations. We
propose a novel two-stage pretraining methodology. The first stage, Ego-Centric
Video Generative Pretraining, trains an Image-to-Video model on 12M ego-centric
manipulation videos to predict future frames conditioned on an initial frame
and a language instruction. The second stage, Human-Centric Trajectory-Aware
Modeling, extends this by jointly predicting future keypoint trajectories,
thereby effectively bridging visual frame prediction with action prediction.
Furthermore, to enhance action representation, we propose ActionVAE, a
variational autoencoder that compresses sequences of actions into compact
latent embeddings, reducing the complexity of the VLA output space. When
finetuned on the same downstream robotics datasets, RynnVLA-001 achieves
superior performance over state-of-the-art baselines, demonstrating that the
proposed pretraining strategy provides a more effective initialization for VLA
models.