神经网络训练算法基准测试
Benchmarking Neural Network Training Algorithms
June 12, 2023
作者: George E. Dahl, Frank Schneider, Zachary Nado, Naman Agarwal, Chandramouli Shama Sastry, Philipp Hennig, Sourabh Medapati, Runa Eschenhagen, Priya Kasimbeg, Daniel Suo, Juhan Bae, Justin Gilmer, Abel L. Peirson, Bilal Khan, Rohan Anil, Mike Rabbat, Shankar Krishnan, Daniel Snider, Ehsan Amid, Kongtao Chen, Chris J. Maddison, Rakshith Vasudev, Michal Badura, Ankush Garg, Peter Mattson
cs.AI
摘要
训练算法在每个深度学习流程中都是至关重要的部分。改进训练算法以加快各种工作负载的训练速度(例如,更好的更新规则、调整协议、学习率计划或数据选择方案)可以节省时间、节省计算资源,并导致更好、更准确的模型。不幸的是,作为一个社区,我们目前无法可靠地识别训练算法的改进,甚至无法确定最先进的训练算法。在这项工作中,通过具体实验,我们认为加快训练进展需要解决经验比较训练算法面临的三个基本挑战的新基准:(1)如何确定何时训练完成并精确测量训练时间,(2)如何处理测量对精确工作负载细节的敏感性,以及(3)如何公平地比较需要超参数调整的算法。为了解决这些挑战,我们引入了一个新的、有竞争力的、基于时间的结果基准,使用固定硬件运行多个工作负载,AlgoPerf:训练算法基准。我们的基准包括一组工作负载变体,可以检测到比当前广泛使用的方法更能适应工作负载变化的基准提交。最后,我们评估了使用各种优化器构建的基线提交,这些优化器代表了当前的实践,以及最近在文献中受到关注的其他优化器。这些基线结果共同证明了我们基准的可行性,显示了方法之间存在非平凡差距,并为未来基准提交设定了一个临时的最先进水平,以便尝试超越。
English
Training algorithms, broadly construed, are an essential part of every deep
learning pipeline. Training algorithm improvements that speed up training
across a wide variety of workloads (e.g., better update rules, tuning
protocols, learning rate schedules, or data selection schemes) could save time,
save computational resources, and lead to better, more accurate, models.
Unfortunately, as a community, we are currently unable to reliably identify
training algorithm improvements, or even determine the state-of-the-art
training algorithm. In this work, using concrete experiments, we argue that
real progress in speeding up training requires new benchmarks that resolve
three basic challenges faced by empirical comparisons of training algorithms:
(1) how to decide when training is complete and precisely measure training
time, (2) how to handle the sensitivity of measurements to exact workload
details, and (3) how to fairly compare algorithms that require hyperparameter
tuning. In order to address these challenges, we introduce a new, competitive,
time-to-result benchmark using multiple workloads running on fixed hardware,
the AlgoPerf: Training Algorithms benchmark. Our benchmark includes a set of
workload variants that make it possible to detect benchmark submissions that
are more robust to workload changes than current widely-used methods. Finally,
we evaluate baseline submissions constructed using various optimizers that
represent current practice, as well as other optimizers that have recently
received attention in the literature. These baseline results collectively
demonstrate the feasibility of our benchmark, show that non-trivial gaps
between methods exist, and set a provisional state-of-the-art for future
benchmark submissions to try and surpass.