オフラインアクター・クリティック強化学習は大規模モデルにスケールする

要旨

本研究では、オフラインのアクター・クリティック強化学習が、トランスフォーマーなどの大規模モデルにスケール可能であり、教師あり学習と同様のスケーリング則に従うことを示す。132の連続制御タスクを含む大規模データセットにおいて、サブ最適およびエキスパートの行動が混在するマルチタスク学習において、オフラインのアクター・クリティックアルゴリズムが強力な教師あり行動クローニングのベースラインを上回ることを確認した。また、Perceiverベースのアクター・クリティックモデルを導入し、自己注意およびクロス注意モジュールを用いたオフライン強化学習を実現するための重要なモデル特徴を明らかにした。全体として、以下の点が明らかとなった：i）単純なオフラインアクター・クリティックアルゴリズムは、現在主流の行動クローニングパラダイムから徐々に移行するための自然な選択肢であり、ii）オフライン強化学習を通じて、サブ最適なデモンストレーションや自己生成データから、現実のロボットタスクを含む多くのドメインを同時に習得するマルチタスクポリシーを学習することが可能である。

English

We show that offline actor-critic reinforcement learning can scale to large models - such as transformers - and follows similar scaling laws as supervised learning. We find that offline actor-critic algorithms can outperform strong, supervised, behavioral cloning baselines for multi-task training on a large dataset containing both sub-optimal and expert behavior on 132 continuous control tasks. We introduce a Perceiver-based actor-critic model and elucidate the key model features needed to make offline RL work with self- and cross-attention modules. Overall, we find that: i) simple offline actor critic algorithms are a natural choice for gradually moving away from the currently predominant paradigm of behavioral cloning, and ii) via offline RL it is possible to learn multi-task policies that master many domains simultaneously, including real robotics tasks, from sub-optimal demonstrations or self-generated data.

オフラインアクター・クリティック強化学習は大規模モデルにスケールする

Offline Actor-Critic Reinforcement Learning Scales to Large Models

要旨

Support