RLinf-VLA：一個統一且高效的VLA+RL訓練框架

摘要

近期在視覺與語言基礎模型上的進展，顯著提升了多模態理解、推理和生成的能力，激發了人們將此類能力擴展到具身環境中的興趣，這主要通過視覺-語言-動作（VLA）模型來實現。然而，大多數VLA模型仍依賴於監督式微調（SFT）進行訓練，這種方法在分佈變化下因錯誤累積而難以泛化。強化學習（RL）提供了一種有前景的替代方案，它通過直接優化任務表現來進行交互，但現有的嘗試仍顯零散，缺乏一個公平且系統的平臺來比較不同模型架構和算法設計。為填補這一空白，我們引入了RLinf-VLA，這是一個統一且高效的框架，用於VLA模型的可擴展RL訓練。該系統採用了高度靈活的資源分配設計，解決了在RL+VLA訓練中整合渲染、訓練和推理的挑戰。特別是對於GPU並行化的模擬器，RLinf-VLA實現了一種新穎的混合細粒度管道分配模式，使訓練速度提升了1.61倍至1.88倍。通過統一的接口，RLinf-VLA無縫支持多種VLA架構（如OpenVLA、OpenVLA-OFT）、多種RL算法（如PPO、GRPO）以及各種模擬器（如ManiSkill、LIBERO）。在模擬環境中，一個統一模型在130個LIBERO任務上達到了98.11%的成功率，在25個ManiSkill任務上達到了97.66%的成功率。除了實證性能外，我們的研究還提煉出一套將RL應用於VLA訓練的最佳實踐，並揭示了這一整合中的新興模式。此外，我們展示了在真實世界Franka機器人上的初步部署，其中RL訓練的策略展現出比SFT訓練更強的泛化能力。我們期望RLinf-VLA能作為加速和標準化具身智能研究的基礎。

English

Recent progress in vision and language foundation models has significantly advanced multimodal understanding, reasoning, and generation, inspiring a surge of interest in extending such capabilities to embodied settings through vision-language-action (VLA) models. Yet, most VLA models are still trained with supervised fine-tuning (SFT), which struggles to generalize under distribution shifts due to error accumulation. Reinforcement learning (RL) offers a promising alternative by directly optimizing task performance through interaction, but existing attempts remain fragmented and lack a unified platform for fair and systematic comparison across model architectures and algorithmic designs. To address this gap, we introduce RLinf-VLA, a unified and efficient framework for scalable RL training of VLA models. The system adopts a highly flexible resource allocation design that addresses the challenge of integrating rendering, training, and inference in RL+VLA training. In particular, for GPU-parallelized simulators, RLinf-VLA implements a novel hybrid fine-grained pipeline allocation mode, achieving a 1.61x-1.88x speedup in training. Through a unified interface, RLinf-VLA seamlessly supports diverse VLA architectures (e.g., OpenVLA, OpenVLA-OFT), multiple RL algorithms (e.g., PPO, GRPO), and various simulators (e.g., ManiSkill, LIBERO). In simulation, a unified model achieves 98.11\% across 130 LIBERO tasks and 97.66\% across 25 ManiSkill tasks. Beyond empirical performance, our study distills a set of best practices for applying RL to VLA training and sheds light on emerging patterns in this integration. Furthermore, we present preliminary deployment on a real-world Franka robot, where RL-trained policies exhibit stronger generalization than those trained with SFT. We envision RLinf-VLA as a foundation to accelerate and standardize research on embodied intelligence.

RLinf-VLA：一個統一且高效的VLA+RL訓練框架

RLinf-VLA: A Unified and Efficient Framework for VLA+RL Training

摘要

Support