離線式演員-評論家強化學習擴展至大型模型

摘要

我們展示離線的演員-評論者強化學習可以擴展到大型模型 - 例如變壓器 - 並遵循與監督學習相似的擴展規律。我們發現，離線的演員-評論者算法在包含132個連續控制任務的大型數據集上進行多任務訓練時，可以優於強大的監督式行為克隆基準。我們引入了基於Perceiver的演員-評論者模型，並闡明了使離線強化學習與自我和交叉注意力模塊配合工作所需的關鍵模型特徵。總的來說，我們發現：i）簡單的離線演員評論者算法是逐漸遠離當前主導的行為克隆範式的自然選擇，ii）通過離線強化學習，可以學習掌握許多領域的多任務策略，包括真實機器人任務，從次優示範或自生成數據中。

English

We show that offline actor-critic reinforcement learning can scale to large models - such as transformers - and follows similar scaling laws as supervised learning. We find that offline actor-critic algorithms can outperform strong, supervised, behavioral cloning baselines for multi-task training on a large dataset containing both sub-optimal and expert behavior on 132 continuous control tasks. We introduce a Perceiver-based actor-critic model and elucidate the key model features needed to make offline RL work with self- and cross-attention modules. Overall, we find that: i) simple offline actor critic algorithms are a natural choice for gradually moving away from the currently predominant paradigm of behavioral cloning, and ii) via offline RL it is possible to learn multi-task policies that master many domains simultaneously, including real robotics tasks, from sub-optimal demonstrations or self-generated data.

離線式演員-評論家強化學習擴展至大型模型

Offline Actor-Critic Reinforcement Learning Scales to Large Models

摘要

Support