边部署边学习：面向通用机器人策略的集群级强化学习（注：译文采用"集群级"对应"Fleet-Scale"以体现多机器人协同规模，"通用机器人策略"准确传达"Generalist Robot Policies"的技术内涵，同时保持学术论文标题的简洁性与专业性。）

摘要

通用机器人策略日益受益于大规模预训练，但仅靠离线数据不足以实现稳健的现实世界部署。已部署的机器人会遇到分布偏移、长尾故障、任务变异以及人工校正机会等固定演示数据集无法完全捕捉的情况。我们提出"部署中学习"（LWD）框架——一种面向通用视觉-语言-动作（VLA）策略持续后训练的集群级离线到在线强化学习方案。该框架以预训练VLA策略为起点，通过整合自主运行和跨机器人集群收集的人工干预数据，构建了部署、物理经验共享、策略改进与再部署的闭环系统。为稳定学习异构、稀疏奖励的集群数据，LWD将用于鲁棒值估计的分布式隐式值学习（DIVL）与适用于基于流的VLA动作生成器的伴随匹配Q学习（QAM）策略提取方法相结合。我们在包含16台双臂机器人的集群上验证LWD，覆盖八项真实世界操作任务，包括语义化商品补货和3-5分钟长周期任务。实验表明，单一通用策略随集群经验积累持续提升，最终达到95%的平均成功率，其中长周期任务的性能提升最为显著。

English

Generalist robot policies increasingly benefit from large-scale pretraining, but offline data alone is insufficient for robust real-world deployment. Deployed robots encounter distribution shifts, long-tail failures, task variations, and human correction opportunities that fixed demonstration datasets cannot fully capture. We present Learning While Deploying (LWD), a fleet-scale offline-to-online reinforcement learning framework for continual post-training of generalist Vision-Language-Action (VLA) policies. Starting from a pretrained VLA policy, LWD closes the loop between deployment, shared physical experience, policy improvement, and redeployment by using autonomous rollouts and human interventions collected across a robot fleet. To stabilize learning from heterogeneous, sparse-reward fleet data, LWD combines Distributional Implicit Value Learning (DIVL) for robust value estimation with Q-learning via Adjoint Matching (QAM) for policy extraction in flow-based VLA action generators. We validate LWD on a fleet of 16 dual-arm robots across eight real-world manipulation tasks, including semantic grocery restocking and 3--5 minute long-horizon tasks. A single generalist policy improves as fleet experience accumulates, reaching an average success rate of 95%, with the largest gains on long-horizon tasks.