遊び心のあるエージェント的ロボット学習

要旨

現在のエージェント型ロボットシステムは、実行可能なCode-as-Policyプログラムを記述し、フィードバックを観察し、複数回の試行を通じて行動を修正できるが、依然としてタスク駆動型である。すなわち、再利用可能なスキルは明示的な指示を受けて初めて獲得される。本稿では、遊び心のあるエージェント型ロボット学習（Playful Agentic Robot Learning）を研究する。これは、具現化されたコーディングエージェントが、下流タスクが到来する前に自己主導的な遊びを継続的なスキル学習段階として活用するものである。我々は、遊び時間中のスキル獲得のために設計されたロボティクスエージェントチームRATsを提案する。遊びの間、RATsは新しくかつ学習可能な探索的タスクを提案し、ロボットコードポリシーを計画・実行し、中間進捗を検証し、障害を診断し、密なステップレベルのフィードバックで再試行し、成功した実行内容を永続的なコードスキルライブラリに抽出する。テスト時には、エージェントはこの凍結されたライブラリから関連スキルを再利用して新しいタスクの解決に役立てる。LIBERO-PROおよびMolmoSpacesにおける実験では、遊びを通じて学習したスキルが、遊びなしおよびランダムプレイのベースラインと比較して、未評価の下流タスクを改善することを示し、LIBERO-PROとMolmoSpacesにおいてCaP-Agent0に対してそれぞれ20.6ポイントおよび17.0パーセントポイントの向上を達成した。さらに、学習済みスキルは、推論時の他のCode-as-Policyエージェントに単にコンテキストへ検索して挿入するだけで利用でき、基礎モデルをファインチューニングすることなく、RoboSuiteおよび実世界への転移をそれぞれ8.9ポイントおよび8.8ポイント改善する。

English

Current agentic robot systems can write executable Code-as-Policy programs, observe feedback, and revise behavior across multiple attempts, but they remain largely task-driven: reusable skills are acquired only after explicit instructions. We study Playful Agentic Robot Learning, where an embodied coding agent uses self-directed play as a continual skill-learning stage before downstream tasks arrive. We introduce RATs, Robotics Agent Teams designed for play-time skill acquisition. During play, RATs proposes novel yet learnable exploratory tasks, plans and executes robot-code policies, verifies intermediate progress, diagnoses failures, retries with dense, step-level feedback, and distills successful executions into a persistent code skill library. At test time, the agent reuses relevant skills from this frozen library to help solve new tasks. Experiments in LIBERO-PRO and MolmoSpaces show that play-learned skills improve held-out downstream tasks over no-play and random-play baselines, with 20.6 and 17.0 percentage-point gains over CaP-Agent0 on LIBERO-PRO and MolmoSpaces, respectively. Moreover, the learned skills can be plugged into other inference-time Code-as-Policy agents by simply retrieving them into the context, improving RoboSuite and real-world transfer by 8.9 and 8.8 points, respectively, without finetuning the underlying model.