ChatPaper.aiChatPaper

RynnVLA-001:利用人類示範提升機器人操作能力

RynnVLA-001: Using Human Demonstrations to Improve Robot Manipulation

September 18, 2025
作者: Yuming Jiang, Siteng Huang, Shengke Xue, Yaxi Zhao, Jun Cen, Sicong Leng, Kehan Li, Jiayan Guo, Kexiang Wang, Mingxiu Chen, Fan Wang, Deli Zhao, Xin Li
cs.AI

摘要

本文介紹了RynnVLA-001,這是一個基於大規模人類示範視頻生成預訓練的視覺-語言-動作(VLA)模型。我們提出了一種新穎的兩階段預訓練方法。第一階段,自我中心視頻生成預訓練,在1200萬個自我中心操作視頻上訓練一個圖像到視頻模型,以根據初始幀和語言指令預測未來幀。第二階段,人類中心軌跡感知建模,通過聯合預測未來關鍵點軌跡來擴展這一方法,從而有效地將視覺幀預測與動作預測相結合。此外,為了增強動作表示,我們提出了ActionVAE,這是一種變分自編碼器,將動作序列壓縮為緊湊的潛在嵌入,降低了VLA輸出空間的複雜性。在相同的下游機器人數據集上進行微調時,RynnVLA-001在性能上超越了最先進的基線模型,證明了所提出的預訓練策略為VLA模型提供了更有效的初始化。
English
This paper presents RynnVLA-001, a vision-language-action(VLA) model built upon large-scale video generative pretraining from human demonstrations. We propose a novel two-stage pretraining methodology. The first stage, Ego-Centric Video Generative Pretraining, trains an Image-to-Video model on 12M ego-centric manipulation videos to predict future frames conditioned on an initial frame and a language instruction. The second stage, Human-Centric Trajectory-Aware Modeling, extends this by jointly predicting future keypoint trajectories, thereby effectively bridging visual frame prediction with action prediction. Furthermore, to enhance action representation, we propose ActionVAE, a variational autoencoder that compresses sequences of actions into compact latent embeddings, reducing the complexity of the VLA output space. When finetuned on the same downstream robotics datasets, RynnVLA-001 achieves superior performance over state-of-the-art baselines, demonstrating that the proposed pretraining strategy provides a more effective initialization for VLA models.
PDF202September 19, 2025