邊部署邊學習:面向通用機器人策略的群組規模強化學習
Learning while Deploying: Fleet-Scale Reinforcement Learning for Generalist Robot Policies
May 1, 2026
作者: Yi Wang, Xinchen Li, Pengwei Xie, Pu Yang, Buqing Nie, Yunuo Cai, Qinglin Zhang, Chendi Qu, Jeffrey Wu, Jianheng Song, Xinlin Ren, Jingshun Huang, Mingjie Pan, Siyuan Feng, Zhi Chen, Jianlan Luo
cs.AI
摘要
通用型機器人策略日益受益於大規模預訓練,但僅靠離線數據不足以實現穩健的實際部署。已部署的機器人會遭遇分佈偏移、長尾失效、任務變異以及人類修正機會等問題,這些都是固定示範數據集無法完整捕捉的。我們提出「部署中學習」(LWD),這是一個針對通用型視覺-語言-動作(VLA)策略進行持續後訓練的集群級離線到線上強化學習框架。從預訓練的VLA策略出發,LWD通過整合機器人集群中自主運作與人類干預所收集的數據,實現了部署、共享實體經驗、策略改進與再部署之間的閉環。為穩定學習來自異質性、稀疏獎勵的集群數據,LWD結合了用於穩健價值估計的分佈隱式價值學習(DIVL),以及透過伴隨匹配(QAM)的Q學習方法,以在基於流的VLA動作生成器中進行策略提取。我們在16台雙臂機器人組成的集群上驗證LWD,涵蓋八項真實世界操作任務,包括語義化雜貨補貨及耗時3至5分鐘的長時程任務。單一通用策略隨集群經驗累積持續改進,平均成功率達95%,其中長時程任務的效能提升最為顯著。
English
Generalist robot policies increasingly benefit from large-scale pretraining, but offline data alone is insufficient for robust real-world deployment. Deployed robots encounter distribution shifts, long-tail failures, task variations, and human correction opportunities that fixed demonstration datasets cannot fully capture. We present Learning While Deploying (LWD), a fleet-scale offline-to-online reinforcement learning framework for continual post-training of generalist Vision-Language-Action (VLA) policies. Starting from a pretrained VLA policy, LWD closes the loop between deployment, shared physical experience, policy improvement, and redeployment by using autonomous rollouts and human interventions collected across a robot fleet. To stabilize learning from heterogeneous, sparse-reward fleet data, LWD combines Distributional Implicit Value Learning (DIVL) for robust value estimation with Q-learning via Adjoint Matching (QAM) for policy extraction in flow-based VLA action generators. We validate LWD on a fleet of 16 dual-arm robots across eight real-world manipulation tasks, including semantic grocery restocking and 3--5 minute long-horizon tasks. A single generalist policy improves as fleet experience accumulates, reaching an average success rate of 95%, with the largest gains on long-horizon tasks.