SimpleVLA-RL：通过强化学习扩展视觉语言动作模型的训练规模

摘要

视觉-语言-动作（VLA）模型近期作为机器人操控的强大范式崭露头角。尽管大规模预训练与监督微调（SFT）带来了显著进展，这些模型仍面临两大根本挑战：一是SFT扩展所需的大规模人类操作机器人轨迹数据稀缺且成本高昂；二是对涉及分布偏移的任务泛化能力有限。大型推理模型（LRMs）的最新突破表明，强化学习（RL）能显著提升逐步推理能力，这自然引出一个问题：RL能否同样改善VLA模型的长期逐步动作规划？本研究中，我们提出了SimpleVLA-RL，一个专为VLA模型设计的高效RL框架。基于veRL，我们引入了VLA特有的轨迹采样、可扩展并行化、多环境渲染及优化损失计算。应用于OpenVLA-OFT时，SimpleVLA-RL在LIBERO上达到了SoTA性能，甚至在我们引入的探索增强策略下，于RoboTwin 1.0&2.0上超越了pi_0。SimpleVLA-RL不仅减少了对大规模数据的依赖，实现了稳健的泛化，还在实际任务中显著超越了SFT。此外，我们在RL训练过程中发现了一种新现象“pushcut”，即策略发现了先前训练过程中未见的新模式。Github: https://github.com/PRIME-RL/SimpleVLA-RL

English

Vision-Language-Action (VLA) models have recently emerged as a powerful paradigm for robotic manipulation. Despite substantial progress enabled by large-scale pretraining and supervised fine-tuning (SFT), these models face two fundamental challenges: (i) the scarcity and high cost of large-scale human-operated robotic trajectories required for SFT scaling, and (ii) limited generalization to tasks involving distribution shift. Recent breakthroughs in Large Reasoning Models (LRMs) demonstrate that reinforcement learning (RL) can dramatically enhance step-by-step reasoning capabilities, raising a natural question: Can RL similarly improve the long-horizon step-by-step action planning of VLA? In this work, we introduce SimpleVLA-RL, an efficient RL framework tailored for VLA models. Building upon veRL, we introduce VLA-specific trajectory sampling, scalable parallelization, multi-environment rendering, and optimized loss computation. When applied to OpenVLA-OFT, SimpleVLA-RL achieves SoTA performance on LIBERO and even outperforms pi_0 on RoboTwin 1.0\&2.0 with the exploration-enhancing strategies we introduce. SimpleVLA-RL not only reduces dependence on large-scale data and enables robust generalization, but also remarkably surpasses SFT in real-world tasks. Moreover, we identify a novel phenomenon ``pushcut'' during RL training, wherein the policy discovers previously unseen patterns beyond those seen in the previous training process. Github: https://github.com/PRIME-RL/SimpleVLA-RL

SimpleVLA-RL：通过强化学习扩展视觉语言动作模型的训练规模

SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning

摘要

Support