回帰をやめよう：スケーラブルな深層強化学習のための分類による価値関数の訓練

要旨

価値関数は深層強化学習（RL）の中核的な要素である。ニューラルネットワークによってパラメータ化されたこれらの関数は、ブートストラップされたターゲット値に一致するように、平均二乗誤差回帰目的を用いて訓練される。しかし、回帰を使用する価値ベースのRL手法を、大規模なネットワーク（例えば高容量のTransformer）にスケールすることは困難であることが証明されている。この難しさは、教師あり学習とは対照的である：交差エントロピー分類損失を活用することで、教師あり手法は大規模なネットワークに確実にスケールしてきた。この差異を観察し、本論文では、価値関数の訓練に回帰の代わりに分類を使用することで、深層RLのスケーラビリティを改善できるかどうかを調査する。カテゴリカル交差エントロピーで訓練された価値関数が、様々な領域で性能とスケーラビリティを大幅に向上させることを実証する。これには、SoftMoEを使用したAtari 2600ゲームの単一タスクRL、大規模ResNetを使用したAtariのマルチタスクRL、Q-transformersを使用したロボット操作、探索なしのチェスプレイ、高容量Transformerを使用した言語エージェントのWordleタスクが含まれ、これらの領域で最先端の結果を達成する。詳細な分析を通じて、カテゴリカル交差エントロピーの利点が主に、ノイジーなターゲットや非定常性といった価値ベースRLに固有の問題を緩和する能力に由来することを示す。全体として、価値関数の訓練にカテゴリカル交差エントロピーを使用するという単純な変更が、ほとんどコストをかけずに深層RLのスケーラビリティを大幅に改善できると主張する。

English

Value functions are a central component of deep reinforcement learning (RL). These functions, parameterized by neural networks, are trained using a mean squared error regression objective to match bootstrapped target values. However, scaling value-based RL methods that use regression to large networks, such as high-capacity Transformers, has proven challenging. This difficulty is in stark contrast to supervised learning: by leveraging a cross-entropy classification loss, supervised methods have scaled reliably to massive networks. Observing this discrepancy, in this paper, we investigate whether the scalability of deep RL can also be improved simply by using classification in place of regression for training value functions. We demonstrate that value functions trained with categorical cross-entropy significantly improves performance and scalability in a variety of domains. These include: single-task RL on Atari 2600 games with SoftMoEs, multi-task RL on Atari with large-scale ResNets, robotic manipulation with Q-transformers, playing Chess without search, and a language-agent Wordle task with high-capacity Transformers, achieving state-of-the-art results on these domains. Through careful analysis, we show that the benefits of categorical cross-entropy primarily stem from its ability to mitigate issues inherent to value-based RL, such as noisy targets and non-stationarity. Overall, we argue that a simple shift to training value functions with categorical cross-entropy can yield substantial improvements in the scalability of deep RL at little-to-no cost.

回帰をやめよう：スケーラブルな深層強化学習のための分類による価値関数の訓練

Stop Regressing: Training Value Functions via Classification for Scalable Deep RL

要旨

Support