DASH：通过批量块预处理与高效逆平方根求解器实现更快速的Shampoo优化算法

摘要

洗发水算法（Shampoo）是当前领先的近似二阶优化器之一：其变体曾赢得MLCommons AlgoPerf竞赛，且被证明能生成激活异常值更低、更易压缩的模型。然而由于计算密集型的内在操作，当前应用该算法需以显著的计算减速为代价。本文通过提出\method（分布式加速洗发水算法）取得重要突破，该基于两项核心新技术的分布式Shampoo实现方案包括：首先，我们证明预条件子块可堆叠为3D张量以显著提升GPU利用率；其次，我们引入Newton-DB迭代法和切比雪夫多项式逼近作为计算Shampoo所需逆矩阵根的新颖快速方法。除算法贡献外，我们首次深入分析了矩阵缩放如何关键性影响Shampoo收敛性。实践层面，我们的GPU感知实现相比优化良好的分布式Shampoo将优化器单步速度提升最高达4.83倍，而Newton-DB在所有测试方法中实现了每轮迭代的最低验证困惑度。代码已开源：https://github.com/IST-DASLab/DASH。

English

Shampoo is one of the leading approximate second-order optimizers: a variant of it has won the MLCommons AlgoPerf competition, and it has been shown to produce models with lower activation outliers that are easier to compress. Yet, applying Shampoo currently comes at the cost of significant computational slowdown, due to its expensive internal operations. In this paper, we take a significant step to address this shortcoming by proposing \method (for Distributed Accelerated SHampoo), a faster implementation of Distributed Shampoo based on two main new techniques: First, we show that preconditioner blocks can be stacked into 3D tensors to significantly improve GPU utilization; second, we introduce the Newton-DB iteration and the Chebyshev polynomial approximations as novel and faster approaches for computing the inverse matrix roots required by Shampoo. Along with these algorithmic contributions, we provide a first in-depth analysis of how matrix scaling critically affects Shampoo convergence. On the practical side, our GPU-aware implementation achieves up to 4.83times faster optimizer steps compared to the well-optimized Distributed Shampoo, while Newton-DB attains the lowest validation perplexity per iteration among all tested methods. Our code is available at https://github.com/IST-DASLab/DASH.

DASH：通过批量块预处理与高效逆平方根求解器实现更快速的Shampoo优化算法

DASH: Faster Shampoo via Batched Block Preconditioning and Efficient Inverse-Root Solvers

摘要

Support