旧优化器，新规范：一本选集

摘要

深度学习优化器通常是通过凸和近似二阶理论相结合的动机驱动的。我们选择了三种这样的方法——Adam、Shampoo和Prodigy——并认为每种方法实际上可以被理解为一个无需凸性假设的一阶方法。事实上，在关闭指数移动平均值后，每种方法等同于在特定范数下的最速下降。通过概括这一观察结果，我们为训练算法开辟了一个新的设计空间。应根据张量在网络中的作用为不同的张量分配不同的算子范数。例如，虽然线性层和嵌入层可能具有相同的权重空间R^{m×n}，但这些层扮演着不同的角色，应分配不同的范数。我们希望这种精心度量神经架构的想法可能会导致更稳定、可扩展，甚至更快的训练。

English

Deep learning optimizers are often motivated through a mix of convex and approximate second-order theory. We select three such methods -- Adam, Shampoo and Prodigy -- and argue that each method can instead be understood as a squarely first-order method without convexity assumptions. In fact, after switching off exponential moving averages, each method is equivalent to steepest descent under a particular norm. By generalizing this observation, we chart a new design space for training algorithms. Different operator norms should be assigned to different tensors based on the role that the tensor plays within the network. For example, while linear and embedding layers may have the same weight space of R^{mtimes n}, these layers play different roles and should be assigned different norms. We hope that this idea of carefully metrizing the neural architecture might lead to more stable, scalable and indeed faster training.

旧优化器，新规范：一本选集

Old Optimizer, New Norm: An Anthology

摘要

Support