SimpleVLA-RL:通过强化学习扩展视觉语言动作模型的训练规模
SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning
September 11, 2025
作者: Haozhan Li, Yuxin Zuo, Jiale Yu, Yuhao Zhang, Zhaohui Yang, Kaiyan Zhang, Xuekai Zhu, Yuchen Zhang, Tianxing Chen, Ganqu Cui, Dehui Wang, Dingxiang Luo, Yuchen Fan, Youbang Sun, Jia Zeng, Jiangmiao Pang, Shanghang Zhang, Yu Wang, Yao Mu, Bowen Zhou, Ning Ding
cs.AI
摘要
视觉-语言-动作(VLA)模型近期作为机器人操控的强大范式崭露头角。尽管大规模预训练与监督微调(SFT)带来了显著进展,这些模型仍面临两大根本挑战:一是SFT扩展所需的大规模人类操作机器人轨迹数据稀缺且成本高昂;二是对涉及分布偏移的任务泛化能力有限。大型推理模型(LRMs)的最新突破表明,强化学习(RL)能显著提升逐步推理能力,这自然引出一个问题:RL能否同样改善VLA模型的长期逐步动作规划?本研究中,我们提出了SimpleVLA-RL,一个专为VLA模型设计的高效RL框架。基于veRL,我们引入了VLA特有的轨迹采样、可扩展并行化、多环境渲染及优化损失计算。应用于OpenVLA-OFT时,SimpleVLA-RL在LIBERO上达到了SoTA性能,甚至在我们引入的探索增强策略下,于RoboTwin 1.0&2.0上超越了pi_0。SimpleVLA-RL不仅减少了对大规模数据的依赖,实现了稳健的泛化,还在实际任务中显著超越了SFT。此外,我们在RL训练过程中发现了一种新现象“pushcut”,即策略发现了先前训练过程中未见的新模式。Github: https://github.com/PRIME-RL/SimpleVLA-RL
English
Vision-Language-Action (VLA) models have recently emerged as a powerful
paradigm for robotic manipulation. Despite substantial progress enabled by
large-scale pretraining and supervised fine-tuning (SFT), these models face two
fundamental challenges: (i) the scarcity and high cost of large-scale
human-operated robotic trajectories required for SFT scaling, and (ii) limited
generalization to tasks involving distribution shift. Recent breakthroughs in
Large Reasoning Models (LRMs) demonstrate that reinforcement learning (RL) can
dramatically enhance step-by-step reasoning capabilities, raising a natural
question: Can RL similarly improve the long-horizon step-by-step action
planning of VLA? In this work, we introduce SimpleVLA-RL, an efficient RL
framework tailored for VLA models. Building upon veRL, we introduce
VLA-specific trajectory sampling, scalable parallelization, multi-environment
rendering, and optimized loss computation. When applied to OpenVLA-OFT,
SimpleVLA-RL achieves SoTA performance on LIBERO and even outperforms pi_0
on RoboTwin 1.0\&2.0 with the exploration-enhancing strategies we introduce.
SimpleVLA-RL not only reduces dependence on large-scale data and enables robust
generalization, but also remarkably surpasses SFT in real-world tasks.
Moreover, we identify a novel phenomenon ``pushcut'' during RL training,
wherein the policy discovers previously unseen patterns beyond those seen in
the previous training process. Github: https://github.com/PRIME-RL/SimpleVLA-RL