SOP:面向视觉-语言-动作模型的可扩展在线后训练系统
SOP: A Scalable Online Post-Training System for Vision-Language-Action Models
January 6, 2026
作者: Mingjie Pan, Siyuan Feng, Qinglin Zhang, Xinchen Li, Jianheng Song, Chendi Qu, Yi Wang, Chuankang Li, Ziyu Xiong, Zhi Chen, Yi Liu, Jianlan Luo
cs.AI
摘要
视觉-语言-动作(VLA)模型通过大规模预训练实现了强大的泛化能力,但实际部署不仅需要广泛通用性,还要求具备专家级的任务执行能力。现有VLA模型的后训练方法通常采用离线、单机器人或任务专用模式,限制了策略在线适应能力与真实交互场景的可扩展学习。我们提出可扩展在线后训练(SOP)系统,支持通用VLA模型在物理世界中直接进行在线、分布式、多任务后训练。SOP通过闭环架构紧密耦合执行与学习:机器人集群持续将策略在线经验与人工干预信号流式传输至中央云学习器,并异步接收更新后的策略。该设计支持即时策略在线修正,通过并行部署扩展经验收集规模,并在适应过程中保持模型通用性。SOP与后训练算法选择无关,我们通过交互式模仿学习(HG-DAgger)和强化学习(RECAP)两种方式实现该系统。在包括布料折叠、箱体组装、商品补货等系列真实世界操作任务中,SOP显著提升了大型预训练VLA模型的性能,同时维持跨任务的统一共享策略。仅需数小时真实世界交互即可实现有效后训练,且性能提升与机器人数量呈近线性增长。这些结果表明,将在线学习与集群部署紧密耦合,对于实现物理世界中通用机器人策略的高效、可靠、可扩展后训练具有关键意义。
English
Vision-language-action (VLA) models achieve strong generalization through large-scale pre-training, but real-world deployment requires expert-level task proficiency in addition to broad generality. Existing post-training approaches for VLA models are typically offline, single-robot, or task-specific, limiting effective on-policy adaptation and scalable learning from real-world interaction. We introduce a Scalable Online Post-training (SOP) system that enables online, distributed, multi-task post-training of generalist VLA models directly in the physical world. SOP tightly couples execution and learning through a closed-loop architecture in which a fleet of robots continuously streams on-policy experience and human intervention signals to a centralized cloud learner, and asynchronously receives updated policies. This design supports prompt on-policy correction, scales experience collection through parallel deployment, and preserves generality during adaptation. SOP is agnostic to the choice of post-training algorithm; we instantiate it with both interactive imitation learning (HG-DAgger) and reinforcement learning (RECAP). Across a range of real-world manipulation tasks including cloth folding, box assembly, and grocery restocking, we show that SOP substantially improves the performance of large pretrained VLA models while maintaining a single shared policy across tasks. Effective post-training can be achieved within hours of real-world interaction, and performance scales near-linearly with the number of robots in the fleet. These results suggest that tightly coupling online learning with fleet-scale deployment is instrumental to enabling efficient, reliable, and scalable post-training of generalist robot policies in the physical world.