LLaRA: 視覚言語ポリシーのためのロボット学習データの強化

要旨

広範な世界知識と強力な推論能力を備えた大規模言語モデル（LLM）は、会話形式の指示-応答ペアとしてタスクを定式化することで、様々な領域にわたる多様なタスクに対処できます。本論文では、LLaRA: Large Language and Robotics Assistantを提案します。これは、ロボットの行動ポリシーを会話として定式化し、ポリシー学習を補完する補助データを用いて訓練することで、改善された応答を提供するフレームワークです。視覚入力を備えたLLM、すなわち視覚言語モデル（VLM）は、状態情報を視覚-テキストプロンプトとして処理し、最適なポリシー決定をテキストで生成する能力を持っています。このような行動ポリシーVLMを訓練するために、まず既存の行動クローニングデータから多様で高品質なロボティクス指示データを生成する自動化パイプラインを導入します。ロボティクスタスクに特化した会話形式の定式化に基づいて、このデータセットコレクションでファインチューニングされたVLMは、意味のあるロボット行動ポリシー決定を生成できます。複数のシミュレーション環境および実世界環境での実験により、提案するLLaRAフレームワークの最先端の性能が実証されました。コード、データセット、および事前訓練済みモデルはhttps://github.com/LostXine/LLaRAで公開されています。

English

Large Language Models (LLMs) equipped with extensive world knowledge and strong reasoning skills can tackle diverse tasks across domains, often by posing them as conversation-style instruction-response pairs. In this paper, we propose LLaRA: Large Language and Robotics Assistant, a framework which formulates robot action policy as conversations, and provides improved responses when trained with auxiliary data that complements policy learning. LLMs with visual inputs, i.e., Vision Language Models (VLMs), have the capacity to process state information as visual-textual prompts and generate optimal policy decisions in text. To train such action policy VLMs, we first introduce an automated pipeline to generate diverse high-quality robotics instruction data from existing behavior cloning data. A VLM finetuned with the resulting collection of datasets based on a conversation-style formulation tailored for robotics tasks, can generate meaningful robot action policy decisions. Our experiments across multiple simulated and real-world environments demonstrate the state-of-the-art performance of the proposed LLaRA framework. The code, datasets, and pretrained models are available at https://github.com/LostXine/LLaRA.

LLaRA: 視覚言語ポリシーのためのロボット学習データの強化

LLaRA: Supercharging Robot Learning Data for Vision-Language Policy

要旨

Support