展開しながら学習：汎用ロボットポリシーのためのフリート規模強化学習

要旨

一般化ロボットポリシーは大規模事前学習の恩恵をますます受けているが、オフラインデータのみでは堅牢な実世界展開には不十分である。展開されたロボットは、固定された実証データセットでは完全に捕捉できない分布シフト、ロングテール障害、タスク変動、人間による修正機会に遭遇する。本論文では、一般化視覚言語行動（VLA）ポリシーの継続的な事後学習のためのフリート規模オフラインからオンライン強化学習フレームワーク「Learning While Deploying（LWD）」を提案する。事前学習済みVLAポリシーを出発点として、LWDはロボットフリート全体で収集された自律ロールアウトと人間介入を活用し、展開、共有された物理的経験、ポリシー改善、再展開の間のループを閉じる。不均質で疎な報酬のフリートデータからの学習を安定化するため、LWDは堅牢な価値推定のための分布的暗黙的価値学習（DIVL）と、フローベースVLA行動生成器におけるポリシー抽出のための随伴マッチングによるQ学習（QAM）を組み合わせる。LWDを16台の双腕ロボットからなるフリートで検証し、意味的グロサリー補充や3～5分の長時間タスクを含む8種類の実世界マニピュレーションタスクで評価した。単一の一般化ポリシーはフリート経験の蓄積に伴って改善され、平均成功率95%に達し、特に長時間タスクで最大の向上を示した。

English

Generalist robot policies increasingly benefit from large-scale pretraining, but offline data alone is insufficient for robust real-world deployment. Deployed robots encounter distribution shifts, long-tail failures, task variations, and human correction opportunities that fixed demonstration datasets cannot fully capture. We present Learning While Deploying (LWD), a fleet-scale offline-to-online reinforcement learning framework for continual post-training of generalist Vision-Language-Action (VLA) policies. Starting from a pretrained VLA policy, LWD closes the loop between deployment, shared physical experience, policy improvement, and redeployment by using autonomous rollouts and human interventions collected across a robot fleet. To stabilize learning from heterogeneous, sparse-reward fleet data, LWD combines Distributional Implicit Value Learning (DIVL) for robust value estimation with Q-learning via Adjoint Matching (QAM) for policy extraction in flow-based VLA action generators. We validate LWD on a fleet of 16 dual-arm robots across eight real-world manipulation tasks, including semantic grocery restocking and 3--5 minute long-horizon tasks. A single generalist policy improves as fleet experience accumulates, reaching an average success rate of 95%, with the largest gains on long-horizon tasks.

展開しながら学習：汎用ロボットポリシーのためのフリート規模強化学習

Learning while Deploying: Fleet-Scale Reinforcement Learning for Generalist Robot Policies

要旨

Support