비전-언어 모델은 강화 학습을 위한 제로샷 보상 모델로 활용될 수 있다

초록

강화 학습(RL)은 수동으로 보상 함수를 지정하거나, 이는 종종 실현 불가능하거나, 대량의 인간 피드백으로부터 보상 모델을 학습하는데, 이는 매우 비용이 많이 듭니다. 우리는 더 샘플 효율적인 대안을 연구합니다: 사전 학습된 시각-언어 모델(VLMs)을 제로샷 보상 모델(RMs)로 사용하여 자연어로 작업을 지정하는 것입니다. 우리는 VLMs를 보상 모델로 사용하는 자연스럽고 일반적인 접근 방식을 제안하며, 이를 VLM-RMs라고 부릅니다. 우리는 CLIP 기반의 VLM-RMs를 사용하여 MuJoCo 휴머노이드가 무릎 꿇기, 다리 벌리기, 연꽃 자세 등과 같은 복잡한 작업을 수동으로 지정된 보상 함수 없이 학습하도록 합니다. 각 작업에 대해, 우리는 원하는 작업을 설명하는 단일 문장 텍스트 프롬프트를 최소한의 프롬프트 엔지니어링으로 제공합니다. 우리는 훈련된 에이전트의 비디오를 https://sites.google.com/view/vlm-rm에서 제공합니다. 우리는 두 번째 "기준" 프롬프트를 제공하고 목표와 기준을 구분하는 데 관련 없는 CLIP 임베딩 공간의 부분을 투영함으로써 성능을 향상시킬 수 있습니다. 또한, 우리는 VLM-RMs에 대한 강력한 스케일링 효과를 발견했습니다: 더 많은 컴퓨팅 자원과 데이터로 훈련된 더 큰 VLMs는 더 나은 보상 모델입니다. 우리가 마주친 VLM-RMs의 실패 모드는 모두 현재 VLMs의 알려진 능력 한계와 관련이 있습니다, 예를 들어 제한된 공간 추론 능력이나 VLM의 분포에서 멀리 벗어난 시각적으로 비현실적인 환경 등입니다. 우리는 VLM이 충분히 크다면 VLM-RMs가 놀랍도록 견고하다는 것을 발견했습니다. 이는 미래의 VLMs이 다양한 RL 응용 프로그램에 대해 점점 더 유용한 보상 모델이 될 것임을 시사합니다.

English

Reinforcement learning (RL) requires either manually specifying a reward function, which is often infeasible, or learning a reward model from a large amount of human feedback, which is often very expensive. We study a more sample-efficient alternative: using pretrained vision-language models (VLMs) as zero-shot reward models (RMs) to specify tasks via natural language. We propose a natural and general approach to using VLMs as reward models, which we call VLM-RMs. We use VLM-RMs based on CLIP to train a MuJoCo humanoid to learn complex tasks without a manually specified reward function, such as kneeling, doing the splits, and sitting in a lotus position. For each of these tasks, we only provide a single sentence text prompt describing the desired task with minimal prompt engineering. We provide videos of the trained agents at: https://sites.google.com/view/vlm-rm. We can improve performance by providing a second ``baseline'' prompt and projecting out parts of the CLIP embedding space irrelevant to distinguish between goal and baseline. Further, we find a strong scaling effect for VLM-RMs: larger VLMs trained with more compute and data are better reward models. The failure modes of VLM-RMs we encountered are all related to known capability limitations of current VLMs, such as limited spatial reasoning ability or visually unrealistic environments that are far off-distribution for the VLM. We find that VLM-RMs are remarkably robust as long as the VLM is large enough. This suggests that future VLMs will become more and more useful reward models for a wide range of RL applications.

비전-언어 모델은 강화 학습을 위한 제로샷 보상 모델로 활용될 수 있다

Vision-Language Models are Zero-Shot Reward Models for Reinforcement Learning

초록

Support