Antiguo Optimizador, Nueva Norma: Una Antología

Resumen

Los optimizadores de aprendizaje profundo suelen estar motivados por una combinación de teoría convexa y aproximada de segundo orden. Seleccionamos tres métodos de este tipo: Adam, Shampoo y Prodigy, y argumentamos que cada método puede ser entendido en cambio como un método de primer orden sin suposiciones de convexidad. De hecho, al desactivar los promedios móviles exponenciales, cada método es equivalente a un descenso más empinado bajo una norma particular. Al generalizar esta observación, trazamos un nuevo espacio de diseño para algoritmos de entrenamiento. Diferentes normas de operador deben asignarse a diferentes tensores según el papel que el tensor desempeña dentro de la red. Por ejemplo, aunque las capas lineales y de incrustación pueden tener el mismo espacio de pesos de R^{mtimes n}, estas capas desempeñan roles diferentes y deben asignarse diferentes normas. Esperamos que esta idea de medir cuidadosamente la arquitectura neuronal pueda conducir a un entrenamiento más estable, escalable y, de hecho, más rápido.

English

Deep learning optimizers are often motivated through a mix of convex and approximate second-order theory. We select three such methods -- Adam, Shampoo and Prodigy -- and argue that each method can instead be understood as a squarely first-order method without convexity assumptions. In fact, after switching off exponential moving averages, each method is equivalent to steepest descent under a particular norm. By generalizing this observation, we chart a new design space for training algorithms. Different operator norms should be assigned to different tensors based on the role that the tensor plays within the network. For example, while linear and embedding layers may have the same weight space of R^{mtimes n}, these layers play different roles and should be assigned different norms. We hope that this idea of carefully metrizing the neural architecture might lead to more stable, scalable and indeed faster training.

Antiguo Optimizador, Nueva Norma: Una Antología

Old Optimizer, New Norm: An Anthology

Resumen

Support