視覺語言模型是用於強化學習的零-shot獎勵模型。

摘要

強化學習（RL）要求手動指定獎勵函數，但這通常是不可行的，或者從大量人類反饋中學習獎勵模型，但這往往非常昂貴。我們研究了一種更節省樣本的替代方案：使用預訓練的視覺語言模型（VLMs）作為零-shot獎勵模型（RMs），通過自然語言指定任務。我們提出了一種自然且通用的方法來使用VLMs作為獎勵模型，我們稱之為VLM-RMs。我們使用基於CLIP的VLM-RMs來訓練MuJoCo仿真人學習複雜任務，而無需手動指定獎勵函數，例如跪下、劈腿和盤腿坐。對於這些任務中的每一個，我們僅提供一個描述所需任務的單句文本提示，並最小化提示工程。我們提供了受過訓練的代理人的視頻：https://sites.google.com/view/vlm-rm。通過提供第二個“基準”提示並投影出與區分目標和基準無關的CLIP嵌入空間的部分，我們可以提高性能。此外，我們發現VLM-RMs存在強大的擴展效應：使用更多計算和數據訓練的更大型VLMs是更好的獎勵模型。我們遇到的VLM-RMs的失敗模式都與當前VLMs已知的能力限制相關，例如有限的空間推理能力或對VLM遠離分佈的視覺不現實環境。我們發現只要VLM足夠大，VLM-RMs就非常穩健。這表明未來的VLMs將成為更加有用的獎勵模型，適用於各種RL應用。

English

Reinforcement learning (RL) requires either manually specifying a reward function, which is often infeasible, or learning a reward model from a large amount of human feedback, which is often very expensive. We study a more sample-efficient alternative: using pretrained vision-language models (VLMs) as zero-shot reward models (RMs) to specify tasks via natural language. We propose a natural and general approach to using VLMs as reward models, which we call VLM-RMs. We use VLM-RMs based on CLIP to train a MuJoCo humanoid to learn complex tasks without a manually specified reward function, such as kneeling, doing the splits, and sitting in a lotus position. For each of these tasks, we only provide a single sentence text prompt describing the desired task with minimal prompt engineering. We provide videos of the trained agents at: https://sites.google.com/view/vlm-rm. We can improve performance by providing a second ``baseline'' prompt and projecting out parts of the CLIP embedding space irrelevant to distinguish between goal and baseline. Further, we find a strong scaling effect for VLM-RMs: larger VLMs trained with more compute and data are better reward models. The failure modes of VLM-RMs we encountered are all related to known capability limitations of current VLMs, such as limited spatial reasoning ability or visually unrealistic environments that are far off-distribution for the VLM. We find that VLM-RMs are remarkably robust as long as the VLM is large enough. This suggests that future VLMs will become more and more useful reward models for a wide range of RL applications.

視覺語言模型是用於強化學習的零-shot獎勵模型。

Vision-Language Models are Zero-Shot Reward Models for Reinforcement Learning

摘要

Support