ChatPaper.aiChatPaper

視覺-語言-動作模型的互動式後訓練

Interactive Post-Training for Vision-Language-Action Models

May 22, 2025
作者: Shuhan Tan, Kairan Dou, Yue Zhao, Philipp Krähenbühl
cs.AI

摘要

我們提出了RIPT-VLA,這是一種簡單且可擴展的基於強化學習的互動式後訓練範式,它僅使用稀疏的二值化成功獎勵來微調預訓練的視覺-語言-動作(VLA)模型。現有的VLA訓練流程嚴重依賴於離線專家示範數據和監督式模仿,這限制了它們在低數據條件下適應新任務和環境的能力。RIPT-VLA通過基於動態滾動採樣和留一法優勢估計的穩定策略優化算法,實現了互動式後訓練,從而解決了這一問題。 RIPT-VLA具有以下特點。首先,它適用於各種VLA模型,使得輕量級QueST模型的性能提升了21.2%,並將7B OpenVLA-OFT模型的成功率提升至前所未有的97.5%。其次,它在計算和數據上都非常高效:僅需一次示範,RIPT-VLA就能讓原本無法工作的SFT模型(4%)在15次迭代內達到97%的成功率。此外,我們展示了RIPT-VLA學習到的策略能夠泛化到不同的任務和場景,並且對初始狀態上下文具有魯棒性。這些結果凸顯了RIPT-VLA作為一種通過最小監督進行VLA模型後訓練的實用且有效的範式。
English
We introduce RIPT-VLA, a simple and scalable reinforcement-learning-based interactive post-training paradigm that fine-tunes pretrained Vision-Language-Action (VLA) models using only sparse binary success rewards. Existing VLA training pipelines rely heavily on offline expert demonstration data and supervised imitation, limiting their ability to adapt to new tasks and environments under low-data regimes. RIPT-VLA addresses this by enabling interactive post-training with a stable policy optimization algorithm based on dynamic rollout sampling and leave-one-out advantage estimation. RIPT-VLA has the following characteristics. First, it applies to various VLA models, resulting in an improvement on the lightweight QueST model by 21.2%, and the 7B OpenVLA-OFT model to an unprecedented 97.5% success rate. Second, it is computationally efficient and data-efficient: with only one demonstration, RIPT-VLA enables an unworkable SFT model (4%) to succeed with a 97% success rate within 15 iterations. Furthermore, we demonstrate that the policy learned by RIPT-VLA generalizes across different tasks and scenarios and is robust to the initial state context. These results highlight RIPT-VLA as a practical and effective paradigm for post-training VLA models through minimal supervision.

Summary

AI-Generated Summary

PDF52May 26, 2025