タイトル: ロボット操作のための潜在的な動作トークンとしての架け橋言語

要旨

最近の大規模言語モデルの発展は、幅広いコーパスで事前学習されたものが、微調整を最小限に抑えながら、さまざまな自然言語処理タスクで著しい成功を収めています。この成功は、長らく高コストのアクションラベル付きデータに制約されてきたロボティクスに新たな可能性をもたらします。我々は問います：相互作用に関連する知識を豊富に含むビデオデータが豊富な「コーパス」として利用可能である場合、同様の生成事前学習アプローチをロボット学習の向上に効果的に適用できるでしょうか？主要な課題は、ロボット操作タスクに利益をもたらす自己回帰事前学習のための効果的な表現を特定することです。動的環境を観察することで新しいスキルを習得する人間の方法に着想を得て、効果的なロボット学習は、低レベルのアクションに密接に関連し、ハードウェアに依存しない動きに重点を置くべきであり、学習した動きを実際のロボットアクションに転送することを容易にします。このために、ビデオコンテンツを潜在的なモーショントークンシーケンスに変換するMotoを導入し、潜在的なモーショントークナイザーによって、ビデオから運動の「言語」を非監督学習で学習します。我々は、モーショントークンの自己回帰を通じてMoto-GPTを事前学習し、多様な視覚的な動きの知識を捉えることができるようにします。事前学習後、Moto-GPTは意味解釈可能なモーショントークンを生成し、妥当な動きの軌跡を予測し、出力の尤度を通じて軌跡の合理性を評価する有望な能力を示します。学習した動きの事前知識を実際のロボットアクションに転送するために、潜在的なモーショントークン予測と実際のロボット制御をシームレスに結ぶ共同微調整戦略を実装します。幅広い実験により、微調整されたMoto-GPTは、ロボット操作のベンチマークで優れた堅牢性と効率性を示し、ビデオデータから下流の視覚操作タスクに知識を転送する効果を強調しています。

English

Recent developments in Large Language Models pre-trained on extensive corpora have shown significant success in various natural language processing tasks with minimal fine-tuning. This success offers new promise for robotics, which has long been constrained by the high cost of action-labeled data. We ask: given the abundant video data containing interaction-related knowledge available as a rich "corpus", can a similar generative pre-training approach be effectively applied to enhance robot learning? The key challenge is to identify an effective representation for autoregressive pre-training that benefits robot manipulation tasks. Inspired by the way humans learn new skills through observing dynamic environments, we propose that effective robotic learning should emphasize motion-related knowledge, which is closely tied to low-level actions and is hardware-agnostic, facilitating the transfer of learned motions to actual robot actions. To this end, we introduce Moto, which converts video content into latent Motion Token sequences by a Latent Motion Tokenizer, learning a bridging "language" of motion from videos in an unsupervised manner. We pre-train Moto-GPT through motion token autoregression, enabling it to capture diverse visual motion knowledge. After pre-training, Moto-GPT demonstrates the promising ability to produce semantically interpretable motion tokens, predict plausible motion trajectories, and assess trajectory rationality through output likelihood. To transfer learned motion priors to real robot actions, we implement a co-fine-tuning strategy that seamlessly bridges latent motion token prediction and real robot control. Extensive experiments show that the fine-tuned Moto-GPT exhibits superior robustness and efficiency on robot manipulation benchmarks, underscoring its effectiveness in transferring knowledge from video data to downstream visual manipulation tasks.

タイトル: ロボット操作のための潜在的な動作トークンとしての架け橋言語

Moto: Latent Motion Token as the Bridging Language for Robot Manipulation

要旨

Support