停止回歸：通過分類訓練價值函數以實現可擴展的深度強化學習

摘要

價值函數是深度強化學習（RL）的核心組件。這些函數由神經網絡參數化，通過使用均方誤差回歸目標來訓練，以匹配自丁式目標值。然而，對於使用回歸的基於值的RL方法來說，將其擴展到大型網絡，例如高容量的Transformer，已被證明具有挑戰性。這種困難與監督學習形成鮮明對比：通過利用交叉熵分類損失，監督方法已經可靠地擴展到大型網絡。觀察到這種差異，在本文中，我們探討了通過在訓練價值函數時使用分類而不是回歸是否也可以改善深度RL的可擴展性。我們展示，使用分類交叉熵訓練的價值函數顯著改善了各種領域的性能和可擴展性。這些領域包括：使用SoftMoEs在Atari 2600遊戲上的單任務RL、在Atari上使用大規模ResNets的多任務RL、使用Q-transformers進行機器人操作、無需搜索即可下棋、以及使用高容量Transformer進行語言代理Wordle任務，在這些領域取得了最先進的結果。通過仔細分析，我們表明分類交叉熵的好處主要來自於其減輕基於值的RL固有問題的能力，例如嘈雜的目標和非穩定性。總的來說，我們認為，僅通過將訓練價值函數的方法從回歸轉為使用分類交叉熵，就可以在幾乎不增加成本的情況下顯著改善深度RL的可擴展性。

English

Value functions are a central component of deep reinforcement learning (RL). These functions, parameterized by neural networks, are trained using a mean squared error regression objective to match bootstrapped target values. However, scaling value-based RL methods that use regression to large networks, such as high-capacity Transformers, has proven challenging. This difficulty is in stark contrast to supervised learning: by leveraging a cross-entropy classification loss, supervised methods have scaled reliably to massive networks. Observing this discrepancy, in this paper, we investigate whether the scalability of deep RL can also be improved simply by using classification in place of regression for training value functions. We demonstrate that value functions trained with categorical cross-entropy significantly improves performance and scalability in a variety of domains. These include: single-task RL on Atari 2600 games with SoftMoEs, multi-task RL on Atari with large-scale ResNets, robotic manipulation with Q-transformers, playing Chess without search, and a language-agent Wordle task with high-capacity Transformers, achieving state-of-the-art results on these domains. Through careful analysis, we show that the benefits of categorical cross-entropy primarily stem from its ability to mitigate issues inherent to value-based RL, such as noisy targets and non-stationarity. Overall, we argue that a simple shift to training value functions with categorical cross-entropy can yield substantial improvements in the scalability of deep RL at little-to-no cost.

停止回歸：通過分類訓練價值函數以實現可擴展的深度強化學習

Stop Regressing: Training Value Functions via Classification for Scalable Deep RL

摘要

Support