DASH: バッチ処理されたブロック前処理と効率的な逆平方根ソルバーによる高速Shampoo

要旨

シャンプーは主要な近似二階最適化手法の一つである。その派生版はMLCommons AlgoPerf競技会で優勝し、活性化の外れ値が少なく圧縮が容易なモデルを生成することが実証されている。しかし、シャンプーの適用には現在、内部演算の計算コストが高いため、大幅な計算速度の低下という代償が伴う。本論文では、この欠点に対処する重要な一歩として、\method（Distributed Accelerated SHampoo）を提案する。これは、主に二つの新技術に基づく分散シャンプーの高速実装である。第一に、前処理行列のブロックを3Dテンソルに積み重ねることでGPU利用率を大幅に向上できることを示す。第二に、シャンプーに必要な逆行列の平方根計算に対して、より高速な新手法としてNewton-DB反復法とチェビシェフ多項式近似を導入する。これらのアルゴリズム的貢献に加えて、行列スケーリングがシャンプーの収束にどのように決定的に影響するかについて初めて詳細な分析を行う。実用面では、当社のGPUを意識した実装は、最適化が十分に行われた分散シャンプーと比較して最大4.83倍の高速化をオプティマイザステップで達成し、Newton-DBは全テスト手法中で反復ごとの検証パープレキシティが最小となった。コードはhttps://github.com/IST-DASLab/DASH で公開されている。

English

Shampoo is one of the leading approximate second-order optimizers: a variant of it has won the MLCommons AlgoPerf competition, and it has been shown to produce models with lower activation outliers that are easier to compress. Yet, applying Shampoo currently comes at the cost of significant computational slowdown, due to its expensive internal operations. In this paper, we take a significant step to address this shortcoming by proposing \method (for Distributed Accelerated SHampoo), a faster implementation of Distributed Shampoo based on two main new techniques: First, we show that preconditioner blocks can be stacked into 3D tensors to significantly improve GPU utilization; second, we introduce the Newton-DB iteration and the Chebyshev polynomial approximations as novel and faster approaches for computing the inverse matrix roots required by Shampoo. Along with these algorithmic contributions, we provide a first in-depth analysis of how matrix scaling critically affects Shampoo convergence. On the practical side, our GPU-aware implementation achieves up to 4.83times faster optimizer steps compared to the well-optimized Distributed Shampoo, while Newton-DB attains the lowest validation perplexity per iteration among all tested methods. Our code is available at https://github.com/IST-DASLab/DASH.

DASH: バッチ処理されたブロック前処理と効率的な逆平方根ソルバーによる高速Shampoo

DASH: Faster Shampoo via Batched Block Preconditioning and Efficient Inverse-Root Solvers

要旨

Support