RLinf-VLA: VLA+RLトレーニングのための統合かつ効率的なフレームワーク

要旨

視覚と言語の基盤モデルにおける最近の進展は、マルチモーダルな理解、推論、生成を大幅に進化させ、視覚-言語-行動（VLA）モデルを通じてその能力を具現化する設定への拡張に大きな関心を呼び起こしています。しかし、ほとんどのVLAモデルは依然として教師あり微調整（SFT）で訓練されており、分布シフト下での汎化に苦戦し、エラーの蓄積が問題となっています。強化学習（RL）は、相互作用を通じて直接タスク性能を最適化する有望な代替手段を提供しますが、既存の試みは断片的で、モデルアーキテクチャとアルゴリズム設計にわたる公平かつ体系的な比較のための統一プラットフォームが欠けています。このギャップを埋めるため、我々はRLinf-VLAを紹介します。これは、VLAモデルのスケーラブルなRL訓練のための統一かつ効率的なフレームワークです。このシステムは、RL+VLA訓練におけるレンダリング、訓練、推論の統合という課題に対処する高度に柔軟なリソース割り当て設計を採用しています。特に、GPU並列化シミュレータに対して、RLinf-VLAは新規のハイブリッド細粒度パイプライン割り当てモードを実装し、訓練速度を1.61倍から1.88倍向上させます。統一インターフェースを通じて、RLinf-VLAは多様なVLAアーキテクチャ（例：OpenVLA、OpenVLA-OFT）、複数のRLアルゴリズム（例：PPO、GRPO）、および様々なシミュレータ（例：ManiSkill、LIBERO）をシームレスにサポートします。シミュレーションでは、統一モデルが130のLIBEROタスクで98.11%、25のManiSkillタスクで97.66%の達成率を示します。経験的な性能を超えて、我々の研究はVLA訓練にRLを適用するためのベストプラクティスを抽出し、この統合における新興パターンに光を当てます。さらに、実世界のFrankaロボットでの予備的な展開を提示し、RLで訓練されたポリシーがSFTで訓練されたものよりも強い汎化能力を示すことを示します。我々はRLinf-VLAを、具現化知能の研究を加速し標準化する基盤として位置づけています。

English

Recent progress in vision and language foundation models has significantly advanced multimodal understanding, reasoning, and generation, inspiring a surge of interest in extending such capabilities to embodied settings through vision-language-action (VLA) models. Yet, most VLA models are still trained with supervised fine-tuning (SFT), which struggles to generalize under distribution shifts due to error accumulation. Reinforcement learning (RL) offers a promising alternative by directly optimizing task performance through interaction, but existing attempts remain fragmented and lack a unified platform for fair and systematic comparison across model architectures and algorithmic designs. To address this gap, we introduce RLinf-VLA, a unified and efficient framework for scalable RL training of VLA models. The system adopts a highly flexible resource allocation design that addresses the challenge of integrating rendering, training, and inference in RL+VLA training. In particular, for GPU-parallelized simulators, RLinf-VLA implements a novel hybrid fine-grained pipeline allocation mode, achieving a 1.61x-1.88x speedup in training. Through a unified interface, RLinf-VLA seamlessly supports diverse VLA architectures (e.g., OpenVLA, OpenVLA-OFT), multiple RL algorithms (e.g., PPO, GRPO), and various simulators (e.g., ManiSkill, LIBERO). In simulation, a unified model achieves 98.11\% across 130 LIBERO tasks and 97.66\% across 25 ManiSkill tasks. Beyond empirical performance, our study distills a set of best practices for applying RL to VLA training and sheds light on emerging patterns in this integration. Furthermore, we present preliminary deployment on a real-world Franka robot, where RL-trained policies exhibit stronger generalization than those trained with SFT. We envision RLinf-VLA as a foundation to accelerate and standardize research on embodied intelligence.

RLinf-VLA: VLA+RLトレーニングのための統合かつ効率的なフレームワーク

RLinf-VLA: A Unified and Efficient Framework for VLA+RL Training

要旨

Support