视觉-语言-动作模型的交互式训练后优化
Interactive Post-Training for Vision-Language-Action Models
May 22, 2025
作者: Shuhan Tan, Kairan Dou, Yue Zhao, Philipp Krähenbühl
cs.AI
摘要
我们提出了RIPT-VLA,这是一种基于强化学习的简单且可扩展的交互式后训练范式,它仅利用稀疏的二元成功奖励对预训练的视觉-语言-动作(VLA)模型进行微调。现有的VLA训练流程严重依赖离线专家示范数据和监督模仿,这限制了它们在低数据环境下适应新任务和新环境的能力。RIPT-VLA通过引入基于动态滚动采样和留一法优势估计的稳定策略优化算法,实现了交互式后训练,有效解决了这一问题。
RIPT-VLA具备以下特点:首先,它适用于多种VLA模型,使轻量级QueST模型的性能提升了21.2%,并将7B参数的OpenVLA-OFT模型的成功率推至前所未有的97.5%。其次,它在计算和数据使用上均高效:仅需一次示范,RIPT-VLA便能让原本无法工作的SFT模型(成功率4%)在15次迭代内达到97%的成功率。此外,我们展示了RIPT-VLA学习到的策略能够泛化至不同任务和场景,并对初始状态上下文具有鲁棒性。这些成果凸显了RIPT-VLA作为一种通过最小化监督实现VLA模型后训练的实用且有效的范式。
English
We introduce RIPT-VLA, a simple and scalable reinforcement-learning-based
interactive post-training paradigm that fine-tunes pretrained
Vision-Language-Action (VLA) models using only sparse binary success rewards.
Existing VLA training pipelines rely heavily on offline expert demonstration
data and supervised imitation, limiting their ability to adapt to new tasks and
environments under low-data regimes. RIPT-VLA addresses this by enabling
interactive post-training with a stable policy optimization algorithm based on
dynamic rollout sampling and leave-one-out advantage estimation.
RIPT-VLA has the following characteristics. First, it applies to various VLA
models, resulting in an improvement on the lightweight QueST model by 21.2%,
and the 7B OpenVLA-OFT model to an unprecedented 97.5% success rate. Second, it
is computationally efficient and data-efficient: with only one demonstration,
RIPT-VLA enables an unworkable SFT model (4%) to succeed with a 97% success
rate within 15 iterations. Furthermore, we demonstrate that the policy learned
by RIPT-VLA generalizes across different tasks and scenarios and is robust to
the initial state context. These results highlight RIPT-VLA as a practical and
effective paradigm for post-training VLA models through minimal supervision.Summary
AI-Generated Summary