SimpleVLA-RL: 強化学習によるVLAトレーニングのスケーリング

要旨

Vision-Language-Action (VLA) モデルは、最近、ロボット操作における強力なパラダイムとして登場しました。大規模な事前学習と教師あり微調整 (SFT) によって大幅な進展が見られたものの、これらのモデルは2つの根本的な課題に直面しています：(i) SFTのスケーリングに必要な大規模な人間操作によるロボット軌跡データの不足とその高コスト、(ii) 分布シフトを伴うタスクへの限定的な汎化能力です。Large Reasoning Models (LRMs) における最近のブレークスルーは、強化学習 (RL) が段階的な推論能力を劇的に向上させることができることを示しており、自然な疑問が生じます：RLは同様にVLAの長期的な段階的行動計画を改善できるのか？本論文では、VLAモデルに特化した効率的なRLフレームワークであるSimpleVLA-RLを紹介します。veRLを基盤として、VLA固有の軌跡サンプリング、スケーラブルな並列化、マルチ環境レンダリング、最適化された損失計算を導入しました。OpenVLA-OFTに適用した場合、SimpleVLA-RLはLIBEROにおいてSoTA性能を達成し、さらに我々が導入した探索強化戦略によりRoboTwin 1.0\&2.0においてpi_0を上回る結果を示しました。SimpleVLA-RLは、大規模データへの依存を軽減し、堅牢な汎化を可能にするだけでなく、現実世界のタスクにおいてSFTを顕著に上回ります。さらに、RLトレーニング中に「pushcut」と呼ばれる新たな現象を発見しました。これは、ポリシーが以前のトレーニングプロセスで見られなかった新たなパターンを発見する現象です。Github: https://github.com/PRIME-RL/SimpleVLA-RL

English

Vision-Language-Action (VLA) models have recently emerged as a powerful paradigm for robotic manipulation. Despite substantial progress enabled by large-scale pretraining and supervised fine-tuning (SFT), these models face two fundamental challenges: (i) the scarcity and high cost of large-scale human-operated robotic trajectories required for SFT scaling, and (ii) limited generalization to tasks involving distribution shift. Recent breakthroughs in Large Reasoning Models (LRMs) demonstrate that reinforcement learning (RL) can dramatically enhance step-by-step reasoning capabilities, raising a natural question: Can RL similarly improve the long-horizon step-by-step action planning of VLA? In this work, we introduce SimpleVLA-RL, an efficient RL framework tailored for VLA models. Building upon veRL, we introduce VLA-specific trajectory sampling, scalable parallelization, multi-environment rendering, and optimized loss computation. When applied to OpenVLA-OFT, SimpleVLA-RL achieves SoTA performance on LIBERO and even outperforms pi_0 on RoboTwin 1.0\&2.0 with the exploration-enhancing strategies we introduce. SimpleVLA-RL not only reduces dependence on large-scale data and enables robust generalization, but also remarkably surpasses SFT in real-world tasks. Moreover, we identify a novel phenomenon ``pushcut'' during RL training, wherein the policy discovers previously unseen patterns beyond those seen in the previous training process. Github: https://github.com/PRIME-RL/SimpleVLA-RL

SimpleVLA-RL: 強化学習によるVLAトレーニングのスケーリング

SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning

要旨

Support