大規模言語モデルを具現化タスクの汎用ポリシーとして

要旨

大規模言語モデル（LLM）が、具現化された視覚タスクに対する汎用的なポリシーとして適応可能であることを示します。本手法「Large LAnguage model Reinforcement Learning Policy（LLaRP）」は、事前学習済みの凍結されたLLMを適応させ、テキスト指示と視覚的なエゴセントリック観察を入力として受け取り、環境内で直接行動を出力します。強化学習を用いて、LLaRPは環境との相互作用のみを通じて「見て行動する」ように訓練されます。LLaRPは、タスク指示の複雑な言い換えに対して頑健であり、新たな最適行動を必要とする新しいタスクに一般化できることを示します。特に、1,000の未見タスクにおいて42%の成功率を達成し、これは他の一般的な学習ベースラインやLLMのゼロショット適用の成功率の1.7倍に相当します。最後に、言語条件付きの大規模マルチタスク具現化AI問題の研究を支援するため、150,000の訓練タスクと1,000のテストタスクからなる新たなベンチマーク「Language Rearrangement」を公開します。未見のLanguage Rearrangement指示におけるLLaRPの動作例はhttps://llm-rl.github.ioでご覧いただけます。

English

We show that large language models (LLMs) can be adapted to be generalizable policies for embodied visual tasks. Our approach, called Large LAnguage model Reinforcement Learning Policy (LLaRP), adapts a pre-trained frozen LLM to take as input text instructions and visual egocentric observations and output actions directly in the environment. Using reinforcement learning, we train LLaRP to see and act solely through environmental interactions. We show that LLaRP is robust to complex paraphrasings of task instructions and can generalize to new tasks that require novel optimal behavior. In particular, on 1,000 unseen tasks it achieves 42% success rate, 1.7x the success rate of other common learned baselines or zero-shot applications of LLMs. Finally, to aid the community in studying language conditioned, massively multi-task, embodied AI problems we release a novel benchmark, Language Rearrangement, consisting of 150,000 training and 1,000 testing tasks for language-conditioned rearrangement. Video examples of LLaRP in unseen Language Rearrangement instructions are at https://llm-rl.github.io.

大規模言語モデルを具現化タスクの汎用ポリシーとして

Large Language Models as Generalizable Policies for Embodied Tasks

要旨

Support