DASH：通过批量块预处理与高效逆矩阵求根求解器实现更快的Shampoo算法

摘要

洗发水算法（Shampoo）是领先的近似二阶优化器之一：其变体曾赢得MLCommons AlgoPerf竞赛，并被证明能生成激活异常值较少、更易压缩的模型。然而，由于内部计算成本高昂，当前应用该算法会显著降低计算速度。本文通过提出\method（分布式加速洗发水算法）迈出重要一步，该实现基于两项新技术：首先，我们证明预条件子块可堆叠为3D张量以显著提升GPU利用率；其次，我们引入牛顿-DB迭代法和切比雪夫多项式逼近作为计算矩阵逆根的新方法。除算法贡献外，我们首次深入分析了矩阵缩放如何关键性影响洗发水算法的收敛性。实践层面，我们的GPU感知实现相比优化后的分布式洗发水算法将优化步速提升最高达4.83倍，而牛顿-DB在所有测试方法中实现了每轮迭代的最低验证困惑度。代码已开源：https://github.com/IST-DASLab/DASH。

English

Shampoo is one of the leading approximate second-order optimizers: a variant of it has won the MLCommons AlgoPerf competition, and it has been shown to produce models with lower activation outliers that are easier to compress. Yet, applying Shampoo currently comes at the cost of significant computational slowdown, due to its expensive internal operations. In this paper, we take a significant step to address this shortcoming by proposing \method (for Distributed Accelerated SHampoo), a faster implementation of Distributed Shampoo based on two main new techniques: First, we show that preconditioner blocks can be stacked into 3D tensors to significantly improve GPU utilization; second, we introduce the Newton-DB iteration and the Chebyshev polynomial approximations as novel and faster approaches for computing the inverse matrix roots required by Shampoo. Along with these algorithmic contributions, we provide a first in-depth analysis of how matrix scaling critically affects Shampoo convergence. On the practical side, our GPU-aware implementation achieves up to 4.83times faster optimizer steps compared to the well-optimized Distributed Shampoo, while Newton-DB attains the lowest validation perplexity per iteration among all tested methods. Our code is available at https://github.com/IST-DASLab/DASH.

DASH：通过批量块预处理与高效逆矩阵求根求解器实现更快的Shampoo算法

DASH: Faster Shampoo via Batched Block Preconditioning and Efficient Inverse-Root Solvers

摘要

Support