用于内存高效网络训练的4位洗发水

摘要

二阶优化器通过维护一个称为预处理器的矩阵，在理论和实践中均优于一阶优化器。构成预处理器及其逆根的状态限制了二阶优化器训练模型的最大尺寸。为解决这一问题，将32位优化器状态压缩为更低位宽已显示出减少内存使用的潜力。然而，当前方法仅适用于一阶优化器。本文提出首个4位二阶优化器，以4位Shampoo为例，其性能与32位优化器相似。我们表明，在4位Shampoo中量化预处理器的特征向量矩阵在理论和实验上均明显优于量化预处理器本身。通过纠正量化特征向量矩阵的正交性，我们增强了预处理器特征向量矩阵的逼近，这也有利于计算其逆4次方根。此外，我们发现，在量化二阶优化器状态时，线性平方量化略优于动态树量化。对各种用于图像分类的网络进行评估表明，我们的4位Shampoo在保持可比的测试准确性的同时更具内存效率。源代码将会提供。

English

Second-order optimizers, maintaining a matrix termed a preconditioner, are superior to first-order optimizers in both theory and practice. The states forming the preconditioner and its inverse root restrict the maximum size of models trained by second-order optimizers. To address this, compressing 32-bit optimizer states to lower bitwidths has shown promise in reducing memory usage. However, current approaches only pertain to first-order optimizers. In this paper, we propose the first 4-bit second-order optimizers, exemplified by 4-bit Shampoo, maintaining performance similar to that of 32-bit ones. We show that quantizing the eigenvector matrix of the preconditioner in 4-bit Shampoo is remarkably better than quantizing the preconditioner itself both theoretically and experimentally. By rectifying the orthogonality of the quantized eigenvector matrix, we enhance the approximation of the preconditioner's eigenvector matrix, which also benefits the computation of its inverse 4-th root. Besides, we find that linear square quantization slightly outperforms dynamic tree quantization when quantizing second-order optimizer states. Evaluation on various networks for image classification demonstrates that our 4-bit Shampoo achieves comparable test accuracy to its 32-bit counterpart while being more memory-efficient. The source code will be made available.

用于内存高效网络训练的4位洗发水

4-bit Shampoo for Memory-Efficient Network Training

摘要

Support