RynnVLA-001：利用人類示範提升機器人操作能力

摘要

本文介紹了RynnVLA-001，這是一個基於大規模人類示範視頻生成預訓練的視覺-語言-動作（VLA）模型。我們提出了一種新穎的兩階段預訓練方法。第一階段，自我中心視頻生成預訓練，在1200萬個自我中心操作視頻上訓練一個圖像到視頻模型，以根據初始幀和語言指令預測未來幀。第二階段，人類中心軌跡感知建模，通過聯合預測未來關鍵點軌跡來擴展這一方法，從而有效地將視覺幀預測與動作預測相結合。此外，為了增強動作表示，我們提出了ActionVAE，這是一種變分自編碼器，將動作序列壓縮為緊湊的潛在嵌入，降低了VLA輸出空間的複雜性。在相同的下游機器人數據集上進行微調時，RynnVLA-001在性能上超越了最先進的基線模型，證明了所提出的預訓練策略為VLA模型提供了更有效的初始化。

English

This paper presents RynnVLA-001, a vision-language-action(VLA) model built upon large-scale video generative pretraining from human demonstrations. We propose a novel two-stage pretraining methodology. The first stage, Ego-Centric Video Generative Pretraining, trains an Image-to-Video model on 12M ego-centric manipulation videos to predict future frames conditioned on an initial frame and a language instruction. The second stage, Human-Centric Trajectory-Aware Modeling, extends this by jointly predicting future keypoint trajectories, thereby effectively bridging visual frame prediction with action prediction. Furthermore, to enhance action representation, we propose ActionVAE, a variational autoencoder that compresses sequences of actions into compact latent embeddings, reducing the complexity of the VLA output space. When finetuned on the same downstream robotics datasets, RynnVLA-001 achieves superior performance over state-of-the-art baselines, demonstrating that the proposed pretraining strategy provides a more effective initialization for VLA models.

RynnVLA-001：利用人類示範提升機器人操作能力

RynnVLA-001: Using Human Demonstrations to Improve Robot Manipulation

摘要

Support