离线演员-评论家强化学习可扩展至大型模型

摘要

我们展示了离线演员-评论家强化学习可以扩展到大型模型 - 例如变压器 - 并遵循与监督学习相似的扩展规律。我们发现，离线演员-评论家算法在包含132个连续控制任务的大型数据集上进行多任务训练时，可以胜过强大的监督行为克隆基线，该数据集包含次优和专家行为。我们引入了基于Perceiver的演员-评论家模型，并阐明了使离线RL与自注意力和交叉注意力模块配合工作所需的关键模型特征。总的来说，我们发现：i）简单的离线演员评论家算法是逐渐摆脱当前主流行为克隆范式的自然选择，ii）通过离线RL，可以学习掌握许多领域的多任务策略，包括真实机器人任务，从次优演示或自动生成的数据中。

English

We show that offline actor-critic reinforcement learning can scale to large models - such as transformers - and follows similar scaling laws as supervised learning. We find that offline actor-critic algorithms can outperform strong, supervised, behavioral cloning baselines for multi-task training on a large dataset containing both sub-optimal and expert behavior on 132 continuous control tasks. We introduce a Perceiver-based actor-critic model and elucidate the key model features needed to make offline RL work with self- and cross-attention modules. Overall, we find that: i) simple offline actor critic algorithms are a natural choice for gradually moving away from the currently predominant paradigm of behavioral cloning, and ii) via offline RL it is possible to learn multi-task policies that master many domains simultaneously, including real robotics tasks, from sub-optimal demonstrations or self-generated data.

离线演员-评论家强化学习可扩展至大型模型

Offline Actor-Critic Reinforcement Learning Scales to Large Models

摘要

Support