如何扩展您的指数移动平均(EMA)
How to Scale Your EMA
July 25, 2023
作者: Dan Busbridge, Jason Ramapuram, Pierre Ablin, Tatiana Likhomanenko, Eeshan Gunesh Dhekane, Xavier Suau, Russ Webb
cs.AI
摘要
在实际机器学习中,保持跨批次大小的训练动态是一种重要工具,因为它可以在批次大小和墙钟时间之间进行权衡。这种权衡通常通过一个缩放规则来实现,例如,在随机梯度下降中,应该将学习率与批次大小线性缩放。另一个实际机器学习中的重要工具是指数移动平均模型(EMA),它是一个不接收梯度信息的模型副本,而是以一定的动量跟随其目标模型。这个EMA模型可以提高监督学习的鲁棒性和泛化性能,稳定伪标记,并为自监督学习提供学习信号。先前的研究将EMA模型与优化分开处理,导致不同批次大小之间的训练动态和较低的模型性能。在这项工作中,我们提供了一个优化缩放规则,以适应模型EMA的存在,并证明其在各种架构、优化器和数据模态下的有效性。我们还展示了该规则在模型EMA有助于优化目标模型的情况下的有效性,使我们能够在小批次和大批次大小下训练基于EMA的伪标记和自监督学习方法。对于自监督学习,我们实现了BYOL的训练,批次大小高达24,576,而不会牺牲性能,实现了最佳的6倍墙钟时间缩短。
English
Preserving training dynamics across batch sizes is an important tool for
practical machine learning as it enables the trade-off between batch size and
wall-clock time. This trade-off is typically enabled by a scaling rule, for
example, in stochastic gradient descent, one should scale the learning rate
linearly with the batch size. Another important tool for practical machine
learning is the model Exponential Moving Average (EMA), which is a model copy
that does not receive gradient information, but instead follows its target
model with some momentum. This model EMA can improve the robustness and
generalization properties of supervised learning, stabilize pseudo-labeling,
and provide a learning signal for Self-Supervised Learning (SSL). Prior works
have treated the model EMA separately from optimization, leading to different
training dynamics across batch sizes and lower model performance. In this work,
we provide a scaling rule for optimization in the presence of model EMAs and
demonstrate its validity across a range of architectures, optimizers, and data
modalities. We also show the rule's validity where the model EMA contributes to
the optimization of the target model, enabling us to train EMA-based
pseudo-labeling and SSL methods at small and large batch sizes. For SSL, we
enable training of BYOL up to batch size 24,576 without sacrificing
performance, optimally a 6times wall-clock time reduction.