ChatPaper.aiChatPaper

边部署边学习:面向通用机器人策略的集群级强化学习 (注:译文采用"集群级"对应"Fleet-Scale"以体现多机器人协同规模,"通用机器人策略"准确传达"Generalist Robot Policies"的技术内涵,同时保持学术论文标题的简洁性与专业性。)

Learning while Deploying: Fleet-Scale Reinforcement Learning for Generalist Robot Policies

May 1, 2026
作者: Yi Wang, Xinchen Li, Pengwei Xie, Pu Yang, Buqing Nie, Yunuo Cai, Qinglin Zhang, Chendi Qu, Jeffrey Wu, Jianheng Song, Xinlin Ren, Jingshun Huang, Mingjie Pan, Siyuan Feng, Zhi Chen, Jianlan Luo
cs.AI

摘要

通用机器人策略日益受益于大规模预训练,但仅靠离线数据不足以实现稳健的现实世界部署。已部署的机器人会遇到分布偏移、长尾故障、任务变异以及人工校正机会等固定演示数据集无法完全捕捉的情况。我们提出"部署中学习"(LWD)框架——一种面向通用视觉-语言-动作(VLA)策略持续后训练的集群级离线到在线强化学习方案。该框架以预训练VLA策略为起点,通过整合自主运行和跨机器人集群收集的人工干预数据,构建了部署、物理经验共享、策略改进与再部署的闭环系统。为稳定学习异构、稀疏奖励的集群数据,LWD将用于鲁棒值估计的分布式隐式值学习(DIVL)与适用于基于流的VLA动作生成器的伴随匹配Q学习(QAM)策略提取方法相结合。我们在包含16台双臂机器人的集群上验证LWD,覆盖八项真实世界操作任务,包括语义化商品补货和3-5分钟长周期任务。实验表明,单一通用策略随集群经验积累持续提升,最终达到95%的平均成功率,其中长周期任务的性能提升最为显著。
English
Generalist robot policies increasingly benefit from large-scale pretraining, but offline data alone is insufficient for robust real-world deployment. Deployed robots encounter distribution shifts, long-tail failures, task variations, and human correction opportunities that fixed demonstration datasets cannot fully capture. We present Learning While Deploying (LWD), a fleet-scale offline-to-online reinforcement learning framework for continual post-training of generalist Vision-Language-Action (VLA) policies. Starting from a pretrained VLA policy, LWD closes the loop between deployment, shared physical experience, policy improvement, and redeployment by using autonomous rollouts and human interventions collected across a robot fleet. To stabilize learning from heterogeneous, sparse-reward fleet data, LWD combines Distributional Implicit Value Learning (DIVL) for robust value estimation with Q-learning via Adjoint Matching (QAM) for policy extraction in flow-based VLA action generators. We validate LWD on a fleet of 16 dual-arm robots across eight real-world manipulation tasks, including semantic grocery restocking and 3--5 minute long-horizon tasks. A single generalist policy improves as fleet experience accumulates, reaching an average success rate of 95%, with the largest gains on long-horizon tasks.
PDF102May 5, 2026