오프라인 액터-크리틱 강화 학습은 대규모 모델로 확장 가능하다

초록

오프라인 액터-크리틱 강화 학습이 트랜스포머와 같은 대형 모델로 확장 가능하며, 지도 학습과 유사한 스케일링 법칙을 따름을 보여준다. 우리는 132개의 연속 제어 작업에 대해 하위 최적 및 전문가 행동을 모두 포함한 대규모 데이터셋에서 다중 작업 학습을 위해 강력한 지도 학습 기반 행동 복제 베이스라인을 능가할 수 있음을 발견했다. 우리는 퍼시버 기반의 액터-크리틱 모델을 소개하고, 오프라인 강화 학습이 자기 주의 및 교차 주의 모듈과 함께 작동하도록 하는 데 필요한 주요 모델 특징을 설명한다. 전반적으로, 우리는 i) 단순한 오프라인 액터-크리틱 알고리즘이 현재 주류인 행동 복제 패러다임에서 점차 벗어나기 위한 자연스러운 선택이며, ii) 오프라인 강화 학습을 통해 하위 최적의 시연 또는 자체 생성 데이터로부터 실제 로봇 공학 작업을 포함한 여러 도메인을 동시에 마스터하는 다중 작업 정책을 학습할 수 있음을 발견했다.

English

We show that offline actor-critic reinforcement learning can scale to large models - such as transformers - and follows similar scaling laws as supervised learning. We find that offline actor-critic algorithms can outperform strong, supervised, behavioral cloning baselines for multi-task training on a large dataset containing both sub-optimal and expert behavior on 132 continuous control tasks. We introduce a Perceiver-based actor-critic model and elucidate the key model features needed to make offline RL work with self- and cross-attention modules. Overall, we find that: i) simple offline actor critic algorithms are a natural choice for gradually moving away from the currently predominant paradigm of behavioral cloning, and ii) via offline RL it is possible to learn multi-task policies that master many domains simultaneously, including real robotics tasks, from sub-optimal demonstrations or self-generated data.

오프라인 액터-크리틱 강화 학습은 대규모 모델로 확장 가능하다

Offline Actor-Critic Reinforcement Learning Scales to Large Models

초록

Support