視覺語言模型作為獎勵來源

摘要

在強化學習的研究前沿之一，是建立能夠在豐富且開放的環境中實現多個目標的通用智能體。建立具有強化學習的通用智能體的一個關鍵限制因素是需要大量的獎勵函數來實現不同的目標。我們研究了使用現成的視覺語言模型（VLMs）作為強化學習智能體的獎勵來源的可行性。我們展示了如何從 CLIP 模型系列中衍生視覺達成各種語言目標的獎勵，並用於訓練能夠實現多種語言目標的強化學習智能體。我們在兩個不同的視覺領域展示了這種方法，並呈現了一個規模化趨勢，顯示更大的 VLMs 導致更準確的視覺目標達成獎勵，進而產生更有能力的強化學習智能體。

English

Building generalist agents that can accomplish many goals in rich open-ended environments is one of the research frontiers for reinforcement learning. A key limiting factor for building generalist agents with RL has been the need for a large number of reward functions for achieving different goals. We investigate the feasibility of using off-the-shelf vision-language models, or VLMs, as sources of rewards for reinforcement learning agents. We show how rewards for visual achievement of a variety of language goals can be derived from the CLIP family of models, and used to train RL agents that can achieve a variety of language goals. We showcase this approach in two distinct visual domains and present a scaling trend showing how larger VLMs lead to more accurate rewards for visual goal achievement, which in turn produces more capable RL agents.

視覺語言模型作為獎勵來源

Vision-Language Models as a Source of Rewards

摘要

Support