視覺語言模型是用於強化學習的零-shot獎勵模型。
Vision-Language Models are Zero-Shot Reward Models for Reinforcement Learning
October 19, 2023
作者: Juan Rocamonde, Victoriano Montesinos, Elvis Nava, Ethan Perez, David Lindner
cs.AI
摘要
強化學習(RL)要求手動指定獎勵函數,但這通常是不可行的,或者從大量人類反饋中學習獎勵模型,但這往往非常昂貴。我們研究了一種更節省樣本的替代方案:使用預訓練的視覺語言模型(VLMs)作為零-shot獎勵模型(RMs),通過自然語言指定任務。我們提出了一種自然且通用的方法來使用VLMs作為獎勵模型,我們稱之為VLM-RMs。我們使用基於CLIP的VLM-RMs來訓練MuJoCo仿真人學習複雜任務,而無需手動指定獎勵函數,例如跪下、劈腿和盤腿坐。對於這些任務中的每一個,我們僅提供一個描述所需任務的單句文本提示,並最小化提示工程。我們提供了受過訓練的代理人的視頻:https://sites.google.com/view/vlm-rm。通過提供第二個“基準”提示並投影出與區分目標和基準無關的CLIP嵌入空間的部分,我們可以提高性能。此外,我們發現VLM-RMs存在強大的擴展效應:使用更多計算和數據訓練的更大型VLMs是更好的獎勵模型。我們遇到的VLM-RMs的失敗模式都與當前VLMs已知的能力限制相關,例如有限的空間推理能力或對VLM遠離分佈的視覺不現實環境。我們發現只要VLM足夠大,VLM-RMs就非常穩健。這表明未來的VLMs將成為更加有用的獎勵模型,適用於各種RL應用。
English
Reinforcement learning (RL) requires either manually specifying a reward
function, which is often infeasible, or learning a reward model from a large
amount of human feedback, which is often very expensive. We study a more
sample-efficient alternative: using pretrained vision-language models (VLMs) as
zero-shot reward models (RMs) to specify tasks via natural language. We propose
a natural and general approach to using VLMs as reward models, which we call
VLM-RMs. We use VLM-RMs based on CLIP to train a MuJoCo humanoid to learn
complex tasks without a manually specified reward function, such as kneeling,
doing the splits, and sitting in a lotus position. For each of these tasks, we
only provide a single sentence text prompt describing the desired task with
minimal prompt engineering. We provide videos of the trained agents at:
https://sites.google.com/view/vlm-rm. We can improve performance by providing a
second ``baseline'' prompt and projecting out parts of the CLIP embedding space
irrelevant to distinguish between goal and baseline. Further, we find a strong
scaling effect for VLM-RMs: larger VLMs trained with more compute and data are
better reward models. The failure modes of VLM-RMs we encountered are all
related to known capability limitations of current VLMs, such as limited
spatial reasoning ability or visually unrealistic environments that are far
off-distribution for the VLM. We find that VLM-RMs are remarkably robust as
long as the VLM is large enough. This suggests that future VLMs will become
more and more useful reward models for a wide range of RL applications.