神經網絡訓練算法的基準測試
Benchmarking Neural Network Training Algorithms
June 12, 2023
作者: George E. Dahl, Frank Schneider, Zachary Nado, Naman Agarwal, Chandramouli Shama Sastry, Philipp Hennig, Sourabh Medapati, Runa Eschenhagen, Priya Kasimbeg, Daniel Suo, Juhan Bae, Justin Gilmer, Abel L. Peirson, Bilal Khan, Rohan Anil, Mike Rabbat, Shankar Krishnan, Daniel Snider, Ehsan Amid, Kongtao Chen, Chris J. Maddison, Rakshith Vasudev, Michal Badura, Ankush Garg, Peter Mattson
cs.AI
摘要
培訓演算法在每個深度學習流程中都是至關重要的部分。改進能加快各種工作負載的培訓演算法(例如更好的更新規則、調整協議、學習速率表、或數據選擇方案)可以節省時間、節省計算資源,並導致更好、更準確的模型。不幸的是,作為一個社群,我們目前無法可靠地識別培訓演算法的改進,甚至無法確定當前最先進的培訓演算法。在這項研究中,我們通過具體實驗論證,加快培訓進度的真正進展需要解決實證比較培訓演算法時面臨的三個基本挑戰:(1)如何決定何時培訓完成並精確測量培訓時間,(2)如何處理測量對精確工作負載細節的敏感性,以及(3)如何公平比較需要超參數調整的演算法。為了應對這些挑戰,我們引入了一個新的、具競爭力的、基於時間結果的基準,使用固定硬體運行多個工作負載,即AlgoPerf:培訓演算法基準。我們的基準包括一組工作負載變體,使得能夠檢測比當前廣泛使用的方法更能適應工作負載變化的基準提交。最後,我們評估了使用代表當前實踐的各種優化器構建的基準提交,以及近期文獻中受到關注的其他優化器。這些基準結果共同證明了我們基準的可行性,顯示方法之間存在非微不足道的差距,並為未來基準提交設定了一個臨時的最先進水準,以便嘗試超越。
English
Training algorithms, broadly construed, are an essential part of every deep
learning pipeline. Training algorithm improvements that speed up training
across a wide variety of workloads (e.g., better update rules, tuning
protocols, learning rate schedules, or data selection schemes) could save time,
save computational resources, and lead to better, more accurate, models.
Unfortunately, as a community, we are currently unable to reliably identify
training algorithm improvements, or even determine the state-of-the-art
training algorithm. In this work, using concrete experiments, we argue that
real progress in speeding up training requires new benchmarks that resolve
three basic challenges faced by empirical comparisons of training algorithms:
(1) how to decide when training is complete and precisely measure training
time, (2) how to handle the sensitivity of measurements to exact workload
details, and (3) how to fairly compare algorithms that require hyperparameter
tuning. In order to address these challenges, we introduce a new, competitive,
time-to-result benchmark using multiple workloads running on fixed hardware,
the AlgoPerf: Training Algorithms benchmark. Our benchmark includes a set of
workload variants that make it possible to detect benchmark submissions that
are more robust to workload changes than current widely-used methods. Finally,
we evaluate baseline submissions constructed using various optimizers that
represent current practice, as well as other optimizers that have recently
received attention in the literature. These baseline results collectively
demonstrate the feasibility of our benchmark, show that non-trivial gaps
between methods exist, and set a provisional state-of-the-art for future
benchmark submissions to try and surpass.