離線式演員-評論家強化學習擴展至大型模型
Offline Actor-Critic Reinforcement Learning Scales to Large Models
February 8, 2024
作者: Jost Tobias Springenberg, Abbas Abdolmaleki, Jingwei Zhang, Oliver Groth, Michael Bloesch, Thomas Lampe, Philemon Brakel, Sarah Bechtle, Steven Kapturowski, Roland Hafner, Nicolas Heess, Martin Riedmiller
cs.AI
摘要
我們展示離線的演員-評論者強化學習可以擴展到大型模型 - 例如變壓器 - 並遵循與監督學習相似的擴展規律。我們發現,離線的演員-評論者算法在包含132個連續控制任務的大型數據集上進行多任務訓練時,可以優於強大的監督式行為克隆基準。我們引入了基於Perceiver的演員-評論者模型,並闡明了使離線強化學習與自我和交叉注意力模塊配合工作所需的關鍵模型特徵。總的來說,我們發現:i)簡單的離線演員評論者算法是逐漸遠離當前主導的行為克隆範式的自然選擇,ii)通過離線強化學習,可以學習掌握許多領域的多任務策略,包括真實機器人任務,從次優示範或自生成數據中。
English
We show that offline actor-critic reinforcement learning can scale to large
models - such as transformers - and follows similar scaling laws as supervised
learning. We find that offline actor-critic algorithms can outperform strong,
supervised, behavioral cloning baselines for multi-task training on a large
dataset containing both sub-optimal and expert behavior on 132 continuous
control tasks. We introduce a Perceiver-based actor-critic model and elucidate
the key model features needed to make offline RL work with self- and
cross-attention modules. Overall, we find that: i) simple offline actor critic
algorithms are a natural choice for gradually moving away from the currently
predominant paradigm of behavioral cloning, and ii) via offline RL it is
possible to learn multi-task policies that master many domains simultaneously,
including real robotics tasks, from sub-optimal demonstrations or
self-generated data.