RLinf-VLA：面向VLA+RL训练的统一高效框架

摘要

近期，视觉与语言基础模型的显著进展极大地推动了多模态理解、推理和生成能力的发展，激发了人们将此类能力扩展至具身环境中的浓厚兴趣，这主要通过视觉-语言-动作（VLA）模型实现。然而，大多数VLA模型仍采用监督微调（SFT）进行训练，这种方法在分布变化下因误差累积而难以泛化。强化学习（RL）通过直接优化任务表现提供了一种有前景的替代方案，但现有尝试较为零散，缺乏一个统一的平台以公平、系统地比较不同模型架构和算法设计。为填补这一空白，我们推出了RLinf-VLA，一个统一且高效的框架，用于VLA模型的可扩展RL训练。该系统采用高度灵活的资源分配设计，解决了在RL+VLA训练中整合渲染、训练和推理的挑战。特别是，针对GPU并行化模拟器，RLinf-VLA实现了一种新颖的混合细粒度管道分配模式，训练速度提升了1.61倍至1.88倍。通过统一接口，RLinf-VLA无缝支持多种VLA架构（如OpenVLA、OpenVLA-OFT）、多种RL算法（如PPO、GRPO）及各类模拟器（如ManiSkill、LIBERO）。在模拟环境中，一个统一模型在130个LIBERO任务上达到了98.11%的完成率，在25个ManiSkill任务上达到了97.66%的完成率。除了实证性能外，我们的研究提炼出一套将RL应用于VLA训练的最佳实践，并揭示了这一融合中的新兴模式。此外，我们展示了在真实世界Franka机器人上的初步部署，其中RL训练的策略展现出比SFT训练更强的泛化能力。我们期待RLinf-VLA成为加速和标准化具身智能研究的基础。

English

Recent progress in vision and language foundation models has significantly advanced multimodal understanding, reasoning, and generation, inspiring a surge of interest in extending such capabilities to embodied settings through vision-language-action (VLA) models. Yet, most VLA models are still trained with supervised fine-tuning (SFT), which struggles to generalize under distribution shifts due to error accumulation. Reinforcement learning (RL) offers a promising alternative by directly optimizing task performance through interaction, but existing attempts remain fragmented and lack a unified platform for fair and systematic comparison across model architectures and algorithmic designs. To address this gap, we introduce RLinf-VLA, a unified and efficient framework for scalable RL training of VLA models. The system adopts a highly flexible resource allocation design that addresses the challenge of integrating rendering, training, and inference in RL+VLA training. In particular, for GPU-parallelized simulators, RLinf-VLA implements a novel hybrid fine-grained pipeline allocation mode, achieving a 1.61x-1.88x speedup in training. Through a unified interface, RLinf-VLA seamlessly supports diverse VLA architectures (e.g., OpenVLA, OpenVLA-OFT), multiple RL algorithms (e.g., PPO, GRPO), and various simulators (e.g., ManiSkill, LIBERO). In simulation, a unified model achieves 98.11\% across 130 LIBERO tasks and 97.66\% across 25 ManiSkill tasks. Beyond empirical performance, our study distills a set of best practices for applying RL to VLA training and sheds light on emerging patterns in this integration. Furthermore, we present preliminary deployment on a real-world Franka robot, where RL-trained policies exhibit stronger generalization than those trained with SFT. We envision RLinf-VLA as a foundation to accelerate and standardize research on embodied intelligence.

RLinf-VLA：面向VLA+RL训练的统一高效框架

RLinf-VLA: A Unified and Efficient Framework for VLA+RL Training

摘要

Support