如何擴展您的EMA
How to Scale Your EMA
July 25, 2023
作者: Dan Busbridge, Jason Ramapuram, Pierre Ablin, Tatiana Likhomanenko, Eeshan Gunesh Dhekane, Xavier Suau, Russ Webb
cs.AI
摘要
在實際機器學習中,保持跨批次大小的訓練動態是一個重要工具,因為它能夠在批次大小和牆上時鐘時間之間取得平衡。這種平衡通常是通過一個縮放規則來實現的,例如,在隨機梯度下降中,應該將學習速率與批次大小成比例地調整。另一個實際機器學習中的重要工具是模型指數移動平均(EMA),這是一個模型副本,不接收梯度信息,而是通過一定的動量跟隨其目標模型。這種模型EMA可以提高監督學習的魯棒性和泛化性能,穩定虛標記,並為自監督學習提供學習信號。先前的研究將模型EMA與優化分開處理,導致不同批次大小之間的訓練動態和模型性能較低。在本研究中,我們提供了一個優化縮放規則,以應對模型EMA存在的情況,並證明其在各種架構、優化器和數據模態下的有效性。我們還展示了該規則在模型EMA有助於優化目標模型的情況下的有效性,使我們能夠在小型和大型批次大小下訓練基於EMA的虛標記和自監督學習方法。對於自監督學習,我們實現了對BYOL的訓練,批次大小可達24,576,而不會降低性能,最佳情況下可將牆上時鐘時間減少6倍。
English
Preserving training dynamics across batch sizes is an important tool for
practical machine learning as it enables the trade-off between batch size and
wall-clock time. This trade-off is typically enabled by a scaling rule, for
example, in stochastic gradient descent, one should scale the learning rate
linearly with the batch size. Another important tool for practical machine
learning is the model Exponential Moving Average (EMA), which is a model copy
that does not receive gradient information, but instead follows its target
model with some momentum. This model EMA can improve the robustness and
generalization properties of supervised learning, stabilize pseudo-labeling,
and provide a learning signal for Self-Supervised Learning (SSL). Prior works
have treated the model EMA separately from optimization, leading to different
training dynamics across batch sizes and lower model performance. In this work,
we provide a scaling rule for optimization in the presence of model EMAs and
demonstrate its validity across a range of architectures, optimizers, and data
modalities. We also show the rule's validity where the model EMA contributes to
the optimization of the target model, enabling us to train EMA-based
pseudo-labeling and SSL methods at small and large batch sizes. For SSL, we
enable training of BYOL up to batch size 24,576 without sacrificing
performance, optimally a 6times wall-clock time reduction.