RLinf-VLA: VLA+RL 훈련을 위한 통합적이고 효율적인 프레임워크

초록

비전과 언어 기반 모델의 최근 발전은 다중 모달 이해, 추론 및 생성 능력을 크게 향상시켰으며, 이를 통해 비전-언어-행동(VLA) 모델을 통해 이러한 능력을 구체화된 환경으로 확장하려는 관심이 급증하고 있습니다. 그러나 대부분의 VLA 모델은 여전히 지도 미세 조정(SFT)으로 훈련되어 있어, 분포 변화에서의 일반화가 오류 누적으로 인해 어려움을 겪고 있습니다. 강화 학습(RL)은 상호작용을 통해 작업 성능을 직접 최적화하는 유망한 대안을 제공하지만, 기존의 시도들은 단편적이며 모델 아키텍처와 알고리즘 설계에 걸친 공정하고 체계적인 비교를 위한 통합 플랫폼이 부족합니다. 이러한 격차를 해결하기 위해, 우리는 RLinf-VLA를 소개합니다. 이는 VLA 모델의 확장 가능한 RL 훈련을 위한 통합적이고 효율적인 프레임워크입니다. 이 시스템은 RL+VLA 훈련에서 렌더링, 훈련 및 추론을 통합하는 도전 과제를 해결하기 위해 매우 유연한 자원 할당 설계를 채택합니다. 특히, GPU 병렬화 시뮬레이터의 경우, RLinf-VLA는 새로운 하이브리드 세분화된 파이프라인 할당 모드를 구현하여 훈련 속도를 1.61배에서 1.88배까지 향상시킵니다. 통합 인터페이스를 통해, RLinf-VLA는 다양한 VLA 아키텍처(예: OpenVLA, OpenVLA-OFT), 여러 RL 알고리즘(예: PPO, GRPO), 그리고 다양한 시뮬레이터(예: ManiSkill, LIBERO)를 원활하게 지원합니다. 시뮬레이션에서, 통합 모델은 130개의 LIBERO 작업에서 98.11%, 25개의 ManiSkill 작업에서 97.66%의 성능을 달성했습니다. 실험적 성능을 넘어, 우리의 연구는 VLA 훈련에 RL을 적용하기 위한 일련의 모범 사례를 정리하고, 이러한 통합에서 나타나는 새로운 패턴을 밝혀냅니다. 더 나아가, 우리는 실제 Franka 로봇에 대한 초기 배포를 제시하며, RL로 훈련된 정책이 SFT로 훈련된 정책보다 더 강력한 일반화 능력을 보여줍니다. 우리는 RLinf-VLA가 구체화된 지능 연구를 가속화하고 표준화하는 기반이 될 것으로 기대합니다.

English

Recent progress in vision and language foundation models has significantly advanced multimodal understanding, reasoning, and generation, inspiring a surge of interest in extending such capabilities to embodied settings through vision-language-action (VLA) models. Yet, most VLA models are still trained with supervised fine-tuning (SFT), which struggles to generalize under distribution shifts due to error accumulation. Reinforcement learning (RL) offers a promising alternative by directly optimizing task performance through interaction, but existing attempts remain fragmented and lack a unified platform for fair and systematic comparison across model architectures and algorithmic designs. To address this gap, we introduce RLinf-VLA, a unified and efficient framework for scalable RL training of VLA models. The system adopts a highly flexible resource allocation design that addresses the challenge of integrating rendering, training, and inference in RL+VLA training. In particular, for GPU-parallelized simulators, RLinf-VLA implements a novel hybrid fine-grained pipeline allocation mode, achieving a 1.61x-1.88x speedup in training. Through a unified interface, RLinf-VLA seamlessly supports diverse VLA architectures (e.g., OpenVLA, OpenVLA-OFT), multiple RL algorithms (e.g., PPO, GRPO), and various simulators (e.g., ManiSkill, LIBERO). In simulation, a unified model achieves 98.11\% across 130 LIBERO tasks and 97.66\% across 25 ManiSkill tasks. Beyond empirical performance, our study distills a set of best practices for applying RL to VLA training and sheds light on emerging patterns in this integration. Furthermore, we present preliminary deployment on a real-world Franka robot, where RL-trained policies exhibit stronger generalization than those trained with SFT. We envision RLinf-VLA as a foundation to accelerate and standardize research on embodied intelligence.

RLinf-VLA: VLA+RL 훈련을 위한 통합적이고 효율적인 프레임워크

RLinf-VLA: A Unified and Efficient Framework for VLA+RL Training

초록

Support