旧最適化手法、新規範: 一編

要旨

ディープラーニングの最適化手法は、しばしば凸および近似二次の理論の組み合わせを通じて動機付けられます。私たちは、Adam、Shampoo、Prodigyの3つの手法を選択し、各手法を凸仮定なしで明確に一次の手法として理解できると主張します。実際、指数移動平均をオフにした後、各手法は特定のノルムの下で最急降下法と等価です。この観察を一般化することで、トレーニングアルゴリズムの新しい設計空間を示します。異なる演算子ノルムは、テンソルがネットワーク内で果たす役割に基づいて異なるテンソルに割り当てるべきです。例えば、線形および埋め込み層は同じ重み空間R^{m×n}を持つかもしれませんが、これらの層は異なる役割を果たし、異なるノルムが割り当てられるべきです。私たちは、ニューラルアーキテクチャを慎重にメトリック化するこの考えが、より安定してスケーラブルで、確かにより速いトレーニングにつながる可能性があると期待しています。

English

Deep learning optimizers are often motivated through a mix of convex and approximate second-order theory. We select three such methods -- Adam, Shampoo and Prodigy -- and argue that each method can instead be understood as a squarely first-order method without convexity assumptions. In fact, after switching off exponential moving averages, each method is equivalent to steepest descent under a particular norm. By generalizing this observation, we chart a new design space for training algorithms. Different operator norms should be assigned to different tensors based on the role that the tensor plays within the network. For example, while linear and embedding layers may have the same weight space of R^{mtimes n}, these layers play different roles and should be assigned different norms. We hope that this idea of carefully metrizing the neural architecture might lead to more stable, scalable and indeed faster training.

旧最適化手法、新規範: 一編

Old Optimizer, New Norm: An Anthology

要旨

Support