회귀 분석을 그만두자: 확장 가능한 심층 강화 학습을 위한 분류 기반 가치 함수 훈련

초록

가치 함수는 심층 강화 학습(RL)의 핵심 구성 요소입니다. 신경망으로 매개변수화된 이러한 함수는 부트스트랩된 목표 값과 일치하도록 평균 제곱 오차 회귀 목표를 사용하여 학습됩니다. 그러나 회귀를 사용하는 가치 기반 RL 방법을 고용량 트랜스포머와 같은 대규모 네트워크로 확장하는 것은 어려운 것으로 입증되었습니다. 이러한 어려움은 지도 학습과는 대조적입니다: 지도 학습 방법은 교차 엔트로피 분류 손실을 활용하여 대규모 네트워크로 안정적으로 확장되었습니다. 이러한 차이를 관찰한 본 논문에서는 가치 함수 학습에 회귀 대신 분류를 사용함으로써 심층 RL의 확장성을 개선할 수 있는지 조사합니다. 우리는 범주형 교차 엔트로피로 학습된 가치 함수가 다양한 도메인에서 성능과 확장성을 크게 향상시킨다는 것을 입증합니다. 이에는 SoftMoE를 사용한 Atari 2600 게임의 단일 작업 RL, 대규모 ResNet을 사용한 Atari의 다중 작업 RL, Q-트랜스포머를 사용한 로봇 조작, 탐색 없이 체스 플레이, 고용량 트랜스포머를 사용한 언어 에이전트 Wordle 작업이 포함되며, 이러한 도메인에서 최첨단 결과를 달성합니다. 신중한 분석을 통해 범주형 교차 엔트로피의 이점이 주로 노이즈가 있는 목표와 비정상성과 같은 가치 기반 RL의 고유한 문제를 완화하는 능력에서 비롯된다는 것을 보여줍니다. 전반적으로, 우리는 가치 함수를 범주형 교차 엔트로피로 학습하는 간단한 전환이 거의 비용 없이 심층 RL의 확장성을 크게 개선할 수 있다고 주장합니다.

English

Value functions are a central component of deep reinforcement learning (RL). These functions, parameterized by neural networks, are trained using a mean squared error regression objective to match bootstrapped target values. However, scaling value-based RL methods that use regression to large networks, such as high-capacity Transformers, has proven challenging. This difficulty is in stark contrast to supervised learning: by leveraging a cross-entropy classification loss, supervised methods have scaled reliably to massive networks. Observing this discrepancy, in this paper, we investigate whether the scalability of deep RL can also be improved simply by using classification in place of regression for training value functions. We demonstrate that value functions trained with categorical cross-entropy significantly improves performance and scalability in a variety of domains. These include: single-task RL on Atari 2600 games with SoftMoEs, multi-task RL on Atari with large-scale ResNets, robotic manipulation with Q-transformers, playing Chess without search, and a language-agent Wordle task with high-capacity Transformers, achieving state-of-the-art results on these domains. Through careful analysis, we show that the benefits of categorical cross-entropy primarily stem from its ability to mitigate issues inherent to value-based RL, such as noisy targets and non-stationarity. Overall, we argue that a simple shift to training value functions with categorical cross-entropy can yield substantial improvements in the scalability of deep RL at little-to-no cost.

회귀 분석을 그만두자: 확장 가능한 심층 강화 학습을 위한 분류 기반 가치 함수 훈련

Stop Regressing: Training Value Functions via Classification for Scalable Deep RL

초록

Support