SimpleVLA-RL:通过强化学习扩展视觉语言动作模型的训练规模
SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning
September 11, 2025
作者: Haozhan Li, Yuxin Zuo, Jiale Yu, Yuhao Zhang, Zhaohui Yang, Kaiyan Zhang, Xuekai Zhu, Yuchen Zhang, Tianxing Chen, Ganqu Cui, Dehui Wang, Dingxiang Luo, Yuchen Fan, Youbang Sun, Jia Zeng, Jiangmiao Pang, Shanghang Zhang, Yu Wang, Yao Mu, Bowen Zhou, Ning Ding
cs.AI
摘要
視覺-語言-動作(Vision-Language-Action, VLA)模型近期已成為機器人操作領域中的一個強大範式。儘管大規模預訓練與監督式微調(Supervised Fine-Tuning, SFT)帶來了顯著進展,這些模型仍面臨兩大根本挑戰:(i) 用於SFT擴展的大規模人類操作機器人軌跡數據稀缺且成本高昂,以及(ii) 在涉及分佈偏移的任務中泛化能力有限。大型推理模型(Large Reasoning Models, LRMs)的最新突破表明,強化學習(Reinforcement Learning, RL)能顯著提升逐步推理能力,這自然引發了一個問題:RL能否同樣改善VLA模型的長時序逐步動作規劃?在本研究中,我們提出了SimpleVLA-RL,這是一個專為VLA模型設計的高效RL框架。基於veRL,我們引入了VLA專屬的軌跡採樣、可擴展的並行化、多環境渲染以及優化的損失計算。當應用於OpenVLA-OFT時,SimpleVLA-RL在LIBERO上達到了SoTA性能,並憑藉我們引入的探索增強策略,在RoboTwin 1.0&2.0上甚至超越了pi_0。SimpleVLA-RL不僅降低了大規模數據的依賴並實現了穩健的泛化,還在真實世界任務中顯著超越了SFT。此外,我們在RL訓練過程中發現了一種新現象「pushcut」,即策略發現了在先前訓練過程中未見過的模式。Github: https://github.com/PRIME-RL/SimpleVLA-RL
English
Vision-Language-Action (VLA) models have recently emerged as a powerful
paradigm for robotic manipulation. Despite substantial progress enabled by
large-scale pretraining and supervised fine-tuning (SFT), these models face two
fundamental challenges: (i) the scarcity and high cost of large-scale
human-operated robotic trajectories required for SFT scaling, and (ii) limited
generalization to tasks involving distribution shift. Recent breakthroughs in
Large Reasoning Models (LRMs) demonstrate that reinforcement learning (RL) can
dramatically enhance step-by-step reasoning capabilities, raising a natural
question: Can RL similarly improve the long-horizon step-by-step action
planning of VLA? In this work, we introduce SimpleVLA-RL, an efficient RL
framework tailored for VLA models. Building upon veRL, we introduce
VLA-specific trajectory sampling, scalable parallelization, multi-environment
rendering, and optimized loss computation. When applied to OpenVLA-OFT,
SimpleVLA-RL achieves SoTA performance on LIBERO and even outperforms pi_0
on RoboTwin 1.0\&2.0 with the exploration-enhancing strategies we introduce.
SimpleVLA-RL not only reduces dependence on large-scale data and enables robust
generalization, but also remarkably surpasses SFT in real-world tasks.
Moreover, we identify a novel phenomenon ``pushcut'' during RL training,
wherein the policy discovers previously unseen patterns beyond those seen in
the previous training process. Github: https://github.com/PRIME-RL/SimpleVLA-RL