視覚言語モデルを報酬源として

要旨

豊かで開放的な環境において多様な目標を達成できる汎用エージェントの構築は、強化学習の研究フロンティアの一つです。強化学習を用いた汎用エージェント構築における主要な制約要因は、異なる目標を達成するために多数の報酬関数が必要とされることでした。本研究では、既存の視覚言語モデル（VLM）を強化学習エージェントの報酬源として利用する可能性を探ります。特に、CLIPファミリーモデルから多様な言語目標の視覚的達成に対する報酬を導出し、それらを用いて様々な言語目標を達成できる強化学習エージェントを訓練する方法を示します。このアプローチを2つの異なる視覚領域で実証し、より大規模なVLMが視覚的目標達成のためのより正確な報酬を生み出し、結果としてより有能な強化学習エージェントを生成するというスケーリングトレンドを提示します。

English

Building generalist agents that can accomplish many goals in rich open-ended environments is one of the research frontiers for reinforcement learning. A key limiting factor for building generalist agents with RL has been the need for a large number of reward functions for achieving different goals. We investigate the feasibility of using off-the-shelf vision-language models, or VLMs, as sources of rewards for reinforcement learning agents. We show how rewards for visual achievement of a variety of language goals can be derived from the CLIP family of models, and used to train RL agents that can achieve a variety of language goals. We showcase this approach in two distinct visual domains and present a scaling trend showing how larger VLMs lead to more accurate rewards for visual goal achievement, which in turn produces more capable RL agents.

視覚言語モデルを報酬源として

Vision-Language Models as a Source of Rewards

要旨

Support