视觉-语言-动作模型的交互式训练后优化

摘要

我们提出了RIPT-VLA，这是一种基于强化学习的简单且可扩展的交互式后训练范式，它仅利用稀疏的二元成功奖励对预训练的视觉-语言-动作（VLA）模型进行微调。现有的VLA训练流程严重依赖离线专家示范数据和监督模仿，这限制了它们在低数据环境下适应新任务和新环境的能力。RIPT-VLA通过引入基于动态滚动采样和留一法优势估计的稳定策略优化算法，实现了交互式后训练，有效解决了这一问题。 RIPT-VLA具备以下特点：首先，它适用于多种VLA模型，使轻量级QueST模型的性能提升了21.2%，并将7B参数的OpenVLA-OFT模型的成功率推至前所未有的97.5%。其次，它在计算和数据使用上均高效：仅需一次示范，RIPT-VLA便能让原本无法工作的SFT模型（成功率4%）在15次迭代内达到97%的成功率。此外，我们展示了RIPT-VLA学习到的策略能够泛化至不同任务和场景，并对初始状态上下文具有鲁棒性。这些成果凸显了RIPT-VLA作为一种通过最小化监督实现VLA模型后训练的实用且有效的范式。

English

We introduce RIPT-VLA, a simple and scalable reinforcement-learning-based interactive post-training paradigm that fine-tunes pretrained Vision-Language-Action (VLA) models using only sparse binary success rewards. Existing VLA training pipelines rely heavily on offline expert demonstration data and supervised imitation, limiting their ability to adapt to new tasks and environments under low-data regimes. RIPT-VLA addresses this by enabling interactive post-training with a stable policy optimization algorithm based on dynamic rollout sampling and leave-one-out advantage estimation. RIPT-VLA has the following characteristics. First, it applies to various VLA models, resulting in an improvement on the lightweight QueST model by 21.2%, and the 7B OpenVLA-OFT model to an unprecedented 97.5% success rate. Second, it is computationally efficient and data-efficient: with only one demonstration, RIPT-VLA enables an unworkable SFT model (4%) to succeed with a 97% success rate within 15 iterations. Furthermore, we demonstrate that the policy learned by RIPT-VLA generalizes across different tasks and scenarios and is robust to the initial state context. These results highlight RIPT-VLA as a practical and effective paradigm for post-training VLA models through minimal supervision.

视觉-语言-动作模型的交互式训练后优化

Interactive Post-Training for Vision-Language-Action Models

摘要

Support