RynnVLA-001: 人間のデモンストレーションを用いたロボット操作の改善

要旨

本論文では、人間のデモンストレーションに基づく大規模なビデオ生成事前学習を基盤とした視覚-言語-行動（VLA）モデル、RynnVLA-001を提案する。我々は、新たな2段階の事前学習手法を提案する。第1段階である「エゴセントリック・ビデオ生成事前学習」では、1200万のエゴセントリックな操作ビデオを用いて、初期フレームと言語指示を条件とした将来フレームを予測する画像-ビデオモデルを学習する。第2段階の「ヒューマンセントリック・軌跡認識モデリング」では、将来のキーポイント軌跡を同時に予測することで、視覚フレーム予測と行動予測を効果的に橋渡しする。さらに、行動表現を強化するために、行動シーケンスをコンパクトな潜在埋め込みに圧縮する変分オートエンコーダであるActionVAEを提案し、VLA出力空間の複雑さを低減する。同じ下流ロボティクスデータセットでファインチューニングを行った結果、RynnVLA-001は最先端のベースラインを上回る性能を達成し、提案した事前学習戦略がVLAモデルに対してより効果的な初期化を提供することを実証した。

English

This paper presents RynnVLA-001, a vision-language-action(VLA) model built upon large-scale video generative pretraining from human demonstrations. We propose a novel two-stage pretraining methodology. The first stage, Ego-Centric Video Generative Pretraining, trains an Image-to-Video model on 12M ego-centric manipulation videos to predict future frames conditioned on an initial frame and a language instruction. The second stage, Human-Centric Trajectory-Aware Modeling, extends this by jointly predicting future keypoint trajectories, thereby effectively bridging visual frame prediction with action prediction. Furthermore, to enhance action representation, we propose ActionVAE, a variational autoencoder that compresses sequences of actions into compact latent embeddings, reducing the complexity of the VLA output space. When finetuned on the same downstream robotics datasets, RynnVLA-001 achieves superior performance over state-of-the-art baselines, demonstrating that the proposed pretraining strategy provides a more effective initialization for VLA models.

RynnVLA-001: 人間のデモンストレーションを用いたロボット操作の改善

RynnVLA-001: Using Human Demonstrations to Improve Robot Manipulation

要旨

Support