RynnVLA-001: Verbesserung der Roboter-Manipulation durch menschliche Demonstrationen

papers.abstract

Dieses Paper stellt RynnVLA-001 vor, ein Vision-Language-Action (VLA)-Modell, das auf groß angelegtem generativem Vortraining mit menschlichen Demonstrationsvideos basiert. Wir schlagen eine neuartige zweistufige Vortrainingsmethodik vor. Die erste Stufe, das Ego-Centric Video Generative Pretraining, trainiert ein Bild-zu-Video-Modell anhand von 12 Millionen egozentrischen Manipulationsvideos, um zukünftige Frames basierend auf einem Ausgangsbild und einer Sprachinstruktion vorherzusagen. Die zweite Stufe, das Human-Centric Trajectory-Aware Modeling, erweitert dies durch die gemeinsame Vorhersage zukünftiger Keypoint-Trajektorien, wodurch visuelle Frame-Vorhersage effektiv mit Aktionsvorhersage verknüpft wird. Darüber hinaus schlagen wir ActionVAE vor, einen Variational Autoencoder, der Aktionssequenzen in kompakte latente Einbettungen komprimiert und so die Komplexität des VLA-Ausgaberaums reduziert. Wenn RynnVLA-001 auf denselben nachgelagerten Robotik-Datensätzen feinabgestimmt wird, erzielt es eine überlegene Leistung im Vergleich zu state-of-the-art Baselines, was zeigt, dass die vorgeschlagene Vortrainingsstrategie eine effektivere Initialisierung für VLA-Modelle bietet.

English

This paper presents RynnVLA-001, a vision-language-action(VLA) model built upon large-scale video generative pretraining from human demonstrations. We propose a novel two-stage pretraining methodology. The first stage, Ego-Centric Video Generative Pretraining, trains an Image-to-Video model on 12M ego-centric manipulation videos to predict future frames conditioned on an initial frame and a language instruction. The second stage, Human-Centric Trajectory-Aware Modeling, extends this by jointly predicting future keypoint trajectories, thereby effectively bridging visual frame prediction with action prediction. Furthermore, to enhance action representation, we propose ActionVAE, a variational autoencoder that compresses sequences of actions into compact latent embeddings, reducing the complexity of the VLA output space. When finetuned on the same downstream robotics datasets, RynnVLA-001 achieves superior performance over state-of-the-art baselines, demonstrating that the proposed pretraining strategy provides a more effective initialization for VLA models.

RynnVLA-001: Verbesserung der Roboter-Manipulation durch menschliche Demonstrationen

RynnVLA-001: Using Human Demonstrations to Improve Robot Manipulation

papers.abstract

Support