SOP:面向視覺-語言-動作模型的可擴展線上訓練後優化系統
SOP: A Scalable Online Post-Training System for Vision-Language-Action Models
January 6, 2026
作者: Mingjie Pan, Siyuan Feng, Qinglin Zhang, Xinchen Li, Jianheng Song, Chendi Qu, Yi Wang, Chuankang Li, Ziyu Xiong, Zhi Chen, Yi Liu, Jianlan Luo
cs.AI
摘要
視覺-語言-動作(VLA)模型通過大規模預訓練實現了強大的泛化能力,但實際部署除廣泛通用性外還需具備專家級的任務熟練度。現有VLA模型的後訓練方法通常為離線、單機器人或任務專用模式,限制了有效的同策略適應與從現實交互中進行可擴展學習的能力。我們提出可擴展線上後訓練(SOP)系統,支持通用型VLA模型在物理世界中直接進行線上分散式多任務後訓練。SOP通過閉環架構緊密耦合執行與學習:機器人集群持續將同策略經驗與人為干預信號流式傳輸至中央雲端學習器,並非同步接收更新策略。此設計支持即時同策略修正,通過平行部署擴展經驗收集規模,並在適應過程中保持通用性。SOP對後訓練算法的選擇具不可知性,我們以交互式模仿學習(HG-DAgger)和強化學習(RECAP)兩種方式實現該系統。在包括布料摺疊、箱子組裝和商品補貨等多種現實操作任務中,SOP顯著提升了大型預訓練VLA模型的性能,同時維持跨任務的單一共享策略。僅需數小時現實交互即可實現有效後訓練,且性能提升與機器人集群規模呈近線性增長。這些結果表明,將線上學習與集群級部署緊密結合,對於在物理世界中實現通用機器人策略的高效、可靠且可擴展的後訓練至關重要。
English
Vision-language-action (VLA) models achieve strong generalization through large-scale pre-training, but real-world deployment requires expert-level task proficiency in addition to broad generality. Existing post-training approaches for VLA models are typically offline, single-robot, or task-specific, limiting effective on-policy adaptation and scalable learning from real-world interaction. We introduce a Scalable Online Post-training (SOP) system that enables online, distributed, multi-task post-training of generalist VLA models directly in the physical world. SOP tightly couples execution and learning through a closed-loop architecture in which a fleet of robots continuously streams on-policy experience and human intervention signals to a centralized cloud learner, and asynchronously receives updated policies. This design supports prompt on-policy correction, scales experience collection through parallel deployment, and preserves generality during adaptation. SOP is agnostic to the choice of post-training algorithm; we instantiate it with both interactive imitation learning (HG-DAgger) and reinforcement learning (RECAP). Across a range of real-world manipulation tasks including cloth folding, box assembly, and grocery restocking, we show that SOP substantially improves the performance of large pretrained VLA models while maintaining a single shared policy across tasks. Effective post-training can be achieved within hours of real-world interaction, and performance scales near-linearly with the number of robots in the fleet. These results suggest that tightly coupling online learning with fleet-scale deployment is instrumental to enabling efficient, reliable, and scalable post-training of generalist robot policies in the physical world.