基盤モデルを統合したエージェントの実現に向けて

要旨

言語モデルと視覚言語モデルは最近、人間の意図の理解、推論、シーン理解、計画的な行動など、テキスト形式での前例のない能力を示しています。本研究では、こうした能力を強化学習（RL）エージェントに組み込み、活用する方法を探ります。私たちは、言語を中核的な推論ツールとして使用するフレームワークを設計し、これがエージェントにどのように効率的な探索、経験データの再利用、スキルのスケジューリング、観察からの学習といった、従来は個別に設計されたアルゴリズムを必要としていた基本的なRLの課題に取り組むことを可能にするかを探ります。私たちの手法を、スパース報酬のシミュレーションされたロボット操作環境でテストし、ロボットが一連のオブジェクトを積み上げるタスクを実行します。探索効率とオフラインデータセットからのデータ再利用能力においてベースラインを大幅に上回る性能向上を示し、学習したスキルを新しいタスクの解決や人間の専門家のビデオの模倣に再利用する方法を実証します。

English

Language Models and Vision Language Models have recently demonstrated unprecedented capabilities in terms of understanding human intentions, reasoning, scene understanding, and planning-like behaviour, in text form, among many others. In this work, we investigate how to embed and leverage such abilities in Reinforcement Learning (RL) agents. We design a framework that uses language as the core reasoning tool, exploring how this enables an agent to tackle a series of fundamental RL challenges, such as efficient exploration, reusing experience data, scheduling skills, and learning from observations, which traditionally require separate, vertically designed algorithms. We test our method on a sparse-reward simulated robotic manipulation environment, where a robot needs to stack a set of objects. We demonstrate substantial performance improvements over baselines in exploration efficiency and ability to reuse data from offline datasets, and illustrate how to reuse learned skills to solve novel tasks or imitate videos of human experts.

基盤モデルを統合したエージェントの実現に向けて

Towards A Unified Agent with Foundation Models

要旨

Support