邊部署邊學習：面向通用機器人策略的群組規模強化學習

摘要

通用型機器人策略日益受益於大規模預訓練，但僅靠離線數據不足以實現穩健的實際部署。已部署的機器人會遭遇分佈偏移、長尾失效、任務變異以及人類修正機會等問題，這些都是固定示範數據集無法完整捕捉的。我們提出「部署中學習」（LWD），這是一個針對通用型視覺-語言-動作（VLA）策略進行持續後訓練的集群級離線到線上強化學習框架。從預訓練的VLA策略出發，LWD通過整合機器人集群中自主運作與人類干預所收集的數據，實現了部署、共享實體經驗、策略改進與再部署之間的閉環。為穩定學習來自異質性、稀疏獎勵的集群數據，LWD結合了用於穩健價值估計的分佈隱式價值學習（DIVL），以及透過伴隨匹配（QAM）的Q學習方法，以在基於流的VLA動作生成器中進行策略提取。我們在16台雙臂機器人組成的集群上驗證LWD，涵蓋八項真實世界操作任務，包括語義化雜貨補貨及耗時3至5分鐘的長時程任務。單一通用策略隨集群經驗累積持續改進，平均成功率達95%，其中長時程任務的效能提升最為顯著。

English

Generalist robot policies increasingly benefit from large-scale pretraining, but offline data alone is insufficient for robust real-world deployment. Deployed robots encounter distribution shifts, long-tail failures, task variations, and human correction opportunities that fixed demonstration datasets cannot fully capture. We present Learning While Deploying (LWD), a fleet-scale offline-to-online reinforcement learning framework for continual post-training of generalist Vision-Language-Action (VLA) policies. Starting from a pretrained VLA policy, LWD closes the loop between deployment, shared physical experience, policy improvement, and redeployment by using autonomous rollouts and human interventions collected across a robot fleet. To stabilize learning from heterogeneous, sparse-reward fleet data, LWD combines Distributional Implicit Value Learning (DIVL) for robust value estimation with Q-learning via Adjoint Matching (QAM) for policy extraction in flow-based VLA action generators. We validate LWD on a fleet of 16 dual-arm robots across eight real-world manipulation tasks, including semantic grocery restocking and 3--5 minute long-horizon tasks. A single generalist policy improves as fleet experience accumulates, reaching an average success rate of 95%, with the largest gains on long-horizon tasks.

邊部署邊學習：面向通用機器人策略的群組規模強化學習

Learning while Deploying: Fleet-Scale Reinforcement Learning for Generalist Robot Policies

摘要

Support