RLinf-VLA:面向VLA+RL训练的统一高效框架
RLinf-VLA: A Unified and Efficient Framework for VLA+RL Training
October 8, 2025
作者: Hongzhi Zang, Mingjie Wei, Si Xu, Yongji Wu, Zhen Guo, Yuanqing Wang, Hao Lin, Liangzhi Shi, Yuqing Xie, Zhexuan Xu, Zhihao Liu, Kang Chen, Wenhao Tang, Quanlu Zhang, Weinan Zhang, Chao Yu, Yu Wang
cs.AI
摘要
近期,视觉与语言基础模型的显著进展极大地推动了多模态理解、推理和生成能力的发展,激发了人们将此类能力扩展至具身环境中的浓厚兴趣,这主要通过视觉-语言-动作(VLA)模型实现。然而,大多数VLA模型仍采用监督微调(SFT)进行训练,这种方法在分布变化下因误差累积而难以泛化。强化学习(RL)通过直接优化任务表现提供了一种有前景的替代方案,但现有尝试较为零散,缺乏一个统一的平台以公平、系统地比较不同模型架构和算法设计。为填补这一空白,我们推出了RLinf-VLA,一个统一且高效的框架,用于VLA模型的可扩展RL训练。该系统采用高度灵活的资源分配设计,解决了在RL+VLA训练中整合渲染、训练和推理的挑战。特别是,针对GPU并行化模拟器,RLinf-VLA实现了一种新颖的混合细粒度管道分配模式,训练速度提升了1.61倍至1.88倍。通过统一接口,RLinf-VLA无缝支持多种VLA架构(如OpenVLA、OpenVLA-OFT)、多种RL算法(如PPO、GRPO)及各类模拟器(如ManiSkill、LIBERO)。在模拟环境中,一个统一模型在130个LIBERO任务上达到了98.11%的完成率,在25个ManiSkill任务上达到了97.66%的完成率。除了实证性能外,我们的研究提炼出一套将RL应用于VLA训练的最佳实践,并揭示了这一融合中的新兴模式。此外,我们展示了在真实世界Franka机器人上的初步部署,其中RL训练的策略展现出比SFT训练更强的泛化能力。我们期待RLinf-VLA成为加速和标准化具身智能研究的基础。
English
Recent progress in vision and language foundation models has significantly
advanced multimodal understanding, reasoning, and generation, inspiring a surge
of interest in extending such capabilities to embodied settings through
vision-language-action (VLA) models. Yet, most VLA models are still trained
with supervised fine-tuning (SFT), which struggles to generalize under
distribution shifts due to error accumulation. Reinforcement learning (RL)
offers a promising alternative by directly optimizing task performance through
interaction, but existing attempts remain fragmented and lack a unified
platform for fair and systematic comparison across model architectures and
algorithmic designs. To address this gap, we introduce RLinf-VLA, a unified and
efficient framework for scalable RL training of VLA models. The system adopts a
highly flexible resource allocation design that addresses the challenge of
integrating rendering, training, and inference in RL+VLA training. In
particular, for GPU-parallelized simulators, RLinf-VLA implements a novel
hybrid fine-grained pipeline allocation mode, achieving a 1.61x-1.88x speedup
in training. Through a unified interface, RLinf-VLA seamlessly supports diverse
VLA architectures (e.g., OpenVLA, OpenVLA-OFT), multiple RL algorithms (e.g.,
PPO, GRPO), and various simulators (e.g., ManiSkill, LIBERO). In simulation, a
unified model achieves 98.11\% across 130 LIBERO tasks and 97.66\% across 25
ManiSkill tasks. Beyond empirical performance, our study distills a set of best
practices for applying RL to VLA training and sheds light on emerging patterns
in this integration. Furthermore, we present preliminary deployment on a
real-world Franka robot, where RL-trained policies exhibit stronger
generalization than those trained with SFT. We envision RLinf-VLA as a
foundation to accelerate and standardize research on embodied intelligence.