停止回归：通过分类训练值函数以实现可扩展的深度强化学习

摘要

价值函数是深度强化学习（RL）的核心组成部分。这些函数由神经网络参数化，使用均方误差回归目标进行训练，以匹配自举目标值。然而，将使用回归的基于值的RL方法扩展到大型网络，如高容量Transformer，一直是具有挑战性的。这种困难与监督学习形成鲜明对比：通过利用交叉熵分类损失，监督方法已经可靠地扩展到大规模网络。观察到这种差异，在本文中，我们调查了是否通过在训练价值函数时使用分类代替回归，也可以简单地改善深度RL的可扩展性。我们证明，使用分类交叉熵训练的价值函数在各种领域中显著提高了性能和可扩展性。这些领域包括：使用SoftMoEs在Atari 2600游戏上的单任务RL，使用大规模ResNets在Atari上的多任务RL，使用Q-transformers进行机器人操纵，无需搜索即可下棋，以及使用高容量Transformer进行语言代理Wordle任务，在这些领域取得了最先进的结果。通过仔细分析，我们表明，分类交叉熵的好处主要源于其减轻基于值的RL固有问题的能力，如嘈杂的目标和非稳态性。总的来说，我们认为，简单地将训练价值函数的方法转变为使用分类交叉熵，可以在几乎没有成本的情况下显著改善深度RL的可扩展性。

English

Value functions are a central component of deep reinforcement learning (RL). These functions, parameterized by neural networks, are trained using a mean squared error regression objective to match bootstrapped target values. However, scaling value-based RL methods that use regression to large networks, such as high-capacity Transformers, has proven challenging. This difficulty is in stark contrast to supervised learning: by leveraging a cross-entropy classification loss, supervised methods have scaled reliably to massive networks. Observing this discrepancy, in this paper, we investigate whether the scalability of deep RL can also be improved simply by using classification in place of regression for training value functions. We demonstrate that value functions trained with categorical cross-entropy significantly improves performance and scalability in a variety of domains. These include: single-task RL on Atari 2600 games with SoftMoEs, multi-task RL on Atari with large-scale ResNets, robotic manipulation with Q-transformers, playing Chess without search, and a language-agent Wordle task with high-capacity Transformers, achieving state-of-the-art results on these domains. Through careful analysis, we show that the benefits of categorical cross-entropy primarily stem from its ability to mitigate issues inherent to value-based RL, such as noisy targets and non-stationarity. Overall, we argue that a simple shift to training value functions with categorical cross-entropy can yield substantial improvements in the scalability of deep RL at little-to-no cost.

停止回归：通过分类训练值函数以实现可扩展的深度强化学习

Stop Regressing: Training Value Functions via Classification for Scalable Deep RL

摘要

Support