视觉-语言模型作为奖励的来源

摘要

在强化学习的研究前沿之一是构建能够在丰富开放式环境中实现多个目标的通用智能体。构建具有强化学习的通用智能体的一个关键限制因素是需要大量的奖励函数来实现不同的目标。我们调查了使用现成的视觉-语言模型（VLMs）作为强化学习智能体奖励来源的可行性。我们展示了如何从CLIP模型系列中导出用于视觉实现各种语言目标的奖励，并用于训练能够实现各种语言目标的RL智能体。我们在两个不同的视觉领域展示了这种方法，并呈现了一个扩展趋势，显示更大的VLMs会导致更准确的视觉目标实现奖励，进而产生更有能力的RL智能体。

English

Building generalist agents that can accomplish many goals in rich open-ended environments is one of the research frontiers for reinforcement learning. A key limiting factor for building generalist agents with RL has been the need for a large number of reward functions for achieving different goals. We investigate the feasibility of using off-the-shelf vision-language models, or VLMs, as sources of rewards for reinforcement learning agents. We show how rewards for visual achievement of a variety of language goals can be derived from the CLIP family of models, and used to train RL agents that can achieve a variety of language goals. We showcase this approach in two distinct visual domains and present a scaling trend showing how larger VLMs lead to more accurate rewards for visual goal achievement, which in turn produces more capable RL agents.

视觉-语言模型作为奖励的来源

Vision-Language Models as a Source of Rewards

摘要

Support