ChatPaper.aiChatPaper

RLinf-VLA:一個統一且高效的VLA+RL訓練框架

RLinf-VLA: A Unified and Efficient Framework for VLA+RL Training

October 8, 2025
作者: Hongzhi Zang, Mingjie Wei, Si Xu, Yongji Wu, Zhen Guo, Yuanqing Wang, Hao Lin, Liangzhi Shi, Yuqing Xie, Zhexuan Xu, Zhihao Liu, Kang Chen, Wenhao Tang, Quanlu Zhang, Weinan Zhang, Chao Yu, Yu Wang
cs.AI

摘要

近期在視覺與語言基礎模型上的進展,顯著提升了多模態理解、推理和生成的能力,激發了人們將此類能力擴展到具身環境中的興趣,這主要通過視覺-語言-動作(VLA)模型來實現。然而,大多數VLA模型仍依賴於監督式微調(SFT)進行訓練,這種方法在分佈變化下因錯誤累積而難以泛化。強化學習(RL)提供了一種有前景的替代方案,它通過直接優化任務表現來進行交互,但現有的嘗試仍顯零散,缺乏一個公平且系統的平臺來比較不同模型架構和算法設計。為填補這一空白,我們引入了RLinf-VLA,這是一個統一且高效的框架,用於VLA模型的可擴展RL訓練。該系統採用了高度靈活的資源分配設計,解決了在RL+VLA訓練中整合渲染、訓練和推理的挑戰。特別是對於GPU並行化的模擬器,RLinf-VLA實現了一種新穎的混合細粒度管道分配模式,使訓練速度提升了1.61倍至1.88倍。通過統一的接口,RLinf-VLA無縫支持多種VLA架構(如OpenVLA、OpenVLA-OFT)、多種RL算法(如PPO、GRPO)以及各種模擬器(如ManiSkill、LIBERO)。在模擬環境中,一個統一模型在130個LIBERO任務上達到了98.11%的成功率,在25個ManiSkill任務上達到了97.66%的成功率。除了實證性能外,我們的研究還提煉出一套將RL應用於VLA訓練的最佳實踐,並揭示了這一整合中的新興模式。此外,我們展示了在真實世界Franka機器人上的初步部署,其中RL訓練的策略展現出比SFT訓練更強的泛化能力。我們期望RLinf-VLA能作為加速和標準化具身智能研究的基礎。
English
Recent progress in vision and language foundation models has significantly advanced multimodal understanding, reasoning, and generation, inspiring a surge of interest in extending such capabilities to embodied settings through vision-language-action (VLA) models. Yet, most VLA models are still trained with supervised fine-tuning (SFT), which struggles to generalize under distribution shifts due to error accumulation. Reinforcement learning (RL) offers a promising alternative by directly optimizing task performance through interaction, but existing attempts remain fragmented and lack a unified platform for fair and systematic comparison across model architectures and algorithmic designs. To address this gap, we introduce RLinf-VLA, a unified and efficient framework for scalable RL training of VLA models. The system adopts a highly flexible resource allocation design that addresses the challenge of integrating rendering, training, and inference in RL+VLA training. In particular, for GPU-parallelized simulators, RLinf-VLA implements a novel hybrid fine-grained pipeline allocation mode, achieving a 1.61x-1.88x speedup in training. Through a unified interface, RLinf-VLA seamlessly supports diverse VLA architectures (e.g., OpenVLA, OpenVLA-OFT), multiple RL algorithms (e.g., PPO, GRPO), and various simulators (e.g., ManiSkill, LIBERO). In simulation, a unified model achieves 98.11\% across 130 LIBERO tasks and 97.66\% across 25 ManiSkill tasks. Beyond empirical performance, our study distills a set of best practices for applying RL to VLA training and sheds light on emerging patterns in this integration. Furthermore, we present preliminary deployment on a real-world Franka robot, where RL-trained policies exhibit stronger generalization than those trained with SFT. We envision RLinf-VLA as a foundation to accelerate and standardize research on embodied intelligence.
PDF302October 9, 2025