ニューラルネットワーク訓練アルゴリズムのベンチマーキング

要旨

広義に解釈すれば、訓練アルゴリズムはすべての深層学習パイプラインにおいて不可欠な要素です。多様なワークロードにおいて訓練を高速化する訓練アルゴリズムの改善（例えば、より優れた更新規則、チューニングプロトコル、学習率スケジュール、またはデータ選択スキーム）は、時間の節約、計算リソースの節約、そしてより優れた、より正確なモデルを導く可能性があります。しかし残念ながら、現状では、訓練アルゴリズムの改善を確実に特定したり、最先端の訓練アルゴリズムを決定したりする能力がコミュニティにはありません。本研究では、具体的な実験を通じて、訓練の高速化における真の進歩には、訓練アルゴリズムの実証的比較が直面する3つの基本的な課題を解決する新しいベンチマークが必要であると主張します。その課題とは、(1) 訓練が完了した時点をどのように決定し、訓練時間を正確に測定するか、(2) 測定値がワークロードの詳細にどのように敏感であるかをどのように扱うか、(3) ハイパーパラメータチューニングを必要とするアルゴリズムをどのように公平に比較するか、です。これらの課題に対処するため、固定ハードウェア上で複数のワークロードを実行する新しい競争的な結果到達時間ベンチマーク、AlgoPerf: Training Algorithmsベンチマークを導入します。このベンチマークには、現在広く使用されている手法よりもワークロードの変化に対してロバストなベンチマーク提出物を検出可能にする一連のワークロードバリアントが含まれています。最後に、現在の実践を代表するさまざまなオプティマイザを使用して構築されたベースライン提出物や、最近の文献で注目を集めている他のオプティマイザを評価します。これらのベースライン結果は、ベンチマークの実現可能性を示し、手法間の非自明なギャップが存在することを示し、将来のベンチマーク提出物が挑戦し、超えるべき暫定的な最先端を設定します。

English

Training algorithms, broadly construed, are an essential part of every deep learning pipeline. Training algorithm improvements that speed up training across a wide variety of workloads (e.g., better update rules, tuning protocols, learning rate schedules, or data selection schemes) could save time, save computational resources, and lead to better, more accurate, models. Unfortunately, as a community, we are currently unable to reliably identify training algorithm improvements, or even determine the state-of-the-art training algorithm. In this work, using concrete experiments, we argue that real progress in speeding up training requires new benchmarks that resolve three basic challenges faced by empirical comparisons of training algorithms: (1) how to decide when training is complete and precisely measure training time, (2) how to handle the sensitivity of measurements to exact workload details, and (3) how to fairly compare algorithms that require hyperparameter tuning. In order to address these challenges, we introduce a new, competitive, time-to-result benchmark using multiple workloads running on fixed hardware, the AlgoPerf: Training Algorithms benchmark. Our benchmark includes a set of workload variants that make it possible to detect benchmark submissions that are more robust to workload changes than current widely-used methods. Finally, we evaluate baseline submissions constructed using various optimizers that represent current practice, as well as other optimizers that have recently received attention in the literature. These baseline results collectively demonstrate the feasibility of our benchmark, show that non-trivial gaps between methods exist, and set a provisional state-of-the-art for future benchmark submissions to try and surpass.

ニューラルネットワーク訓練アルゴリズムのベンチマーキング

Benchmarking Neural Network Training Algorithms

要旨

Support