RACER: 豊かな言語による失敗回復ポリシーによる模倣学習

要旨

ロボティック操作における堅牢で修正可能な視覚運動ポリシーを開発することは、失敗からの自己回復メカニズムの欠如と、単純な言語指示の制約により困難です。これらの課題に対処するため、我々は、エキスパートのデモンストレーションを自動的に拡張し、失敗回復軌跡と細かい言語注釈をトレーニングするためのスケーラブルなデータ生成パイプラインを提案します。次に、リッチ言語による失敗回復（RACER）という、失敗回復データを豊富な言語記述と組み合わせてロボット制御を強化するスーパーバイザー・アクターフレームワークを紹介します。RACERには、オンラインスーパーバイザーとして機能するビジョン言語モデル（VLM）があり、エラー修正とタスク実行のための詳細な言語ガイダンスを提供し、次のアクションを予測するアクターとして言語条件付きの視覚運動ポリシーがあります。実験結果によると、RACERは、RLbench上での様々な評価設定において、標準の長期タスク、動的なゴール変更タスク、ゼロショット未知タスクを含む、シミュレートおよび実世界環境の両方で優れた性能を達成し、最先端のRobotic View Transformer（RVT）を上回ることが示されました。ビデオとコードは以下で入手可能です：https://rich-language-failure-recovery.github.io.

English

Developing robust and correctable visuomotor policies for robotic manipulation is challenging due to the lack of self-recovery mechanisms from failures and the limitations of simple language instructions in guiding robot actions. To address these issues, we propose a scalable data generation pipeline that automatically augments expert demonstrations with failure recovery trajectories and fine-grained language annotations for training. We then introduce Rich languAge-guided failure reCovERy (RACER), a supervisor-actor framework, which combines failure recovery data with rich language descriptions to enhance robot control. RACER features a vision-language model (VLM) that acts as an online supervisor, providing detailed language guidance for error correction and task execution, and a language-conditioned visuomotor policy as an actor to predict the next actions. Our experimental results show that RACER outperforms the state-of-the-art Robotic View Transformer (RVT) on RLbench across various evaluation settings, including standard long-horizon tasks, dynamic goal-change tasks and zero-shot unseen tasks, achieving superior performance in both simulated and real world environments. Videos and code are available at: https://rich-language-failure-recovery.github.io.

RACER: 豊かな言語による失敗回復ポリシーによる模倣学習

RACER: Rich Language-Guided Failure Recovery Policies for Imitation Learning

要旨

Support