视觉-语言模型作为奖励的来源
Vision-Language Models as a Source of Rewards
December 14, 2023
作者: Kate Baumli, Satinder Baveja, Feryal Behbahani, Harris Chan, Gheorghe Comanici, Sebastian Flennerhag, Maxime Gazeau, Kristian Holsheimer, Dan Horgan, Michael Laskin, Clare Lyle, Hussain Masoom, Kay McKinney, Volodymyr Mnih, Alexander Neitz, Fabio Pardo, Jack Parker-Holder, John Quan, Tim Rocktäschel, Himanshu Sahni, Tom Schaul, Yannick Schroecker, Stephen Spencer, Richie Steigerwald, Luyu Wang, Lei Zhang
cs.AI
摘要
在强化学习的研究前沿之一是构建能够在丰富开放式环境中实现多个目标的通用智能体。构建具有强化学习的通用智能体的一个关键限制因素是需要大量的奖励函数来实现不同的目标。我们调查了使用现成的视觉-语言模型(VLMs)作为强化学习智能体奖励来源的可行性。我们展示了如何从CLIP模型系列中导出用于视觉实现各种语言目标的奖励,并用于训练能够实现各种语言目标的RL智能体。我们在两个不同的视觉领域展示了这种方法,并呈现了一个扩展趋势,显示更大的VLMs会导致更准确的视觉目标实现奖励,进而产生更有能力的RL智能体。
English
Building generalist agents that can accomplish many goals in rich open-ended
environments is one of the research frontiers for reinforcement learning. A key
limiting factor for building generalist agents with RL has been the need for a
large number of reward functions for achieving different goals. We investigate
the feasibility of using off-the-shelf vision-language models, or VLMs, as
sources of rewards for reinforcement learning agents. We show how rewards for
visual achievement of a variety of language goals can be derived from the CLIP
family of models, and used to train RL agents that can achieve a variety of
language goals. We showcase this approach in two distinct visual domains and
present a scaling trend showing how larger VLMs lead to more accurate rewards
for visual goal achievement, which in turn produces more capable RL agents.