記憶效率網絡訓練的4位元洗髮水

摘要

二階優化器在理論和實踐中均優於一階優化器，其維護一個被稱為預條件器的矩陣。構成預條件器及其逆根的狀態限制了二階優化器訓練模型的最大尺寸。為解決這個問題，將32位元優化器狀態壓縮為較低位元寬度已顯示出減少內存使用的潛力。然而，目前的方法僅適用於一階優化器。在本文中，我們提出首個4位元二階優化器，以4位元Shampoo為例，其性能與32位元優化器相似。我們展示，在4位元Shampoo中量化預條件器的特徵向量矩陣在理論和實驗上均比量化預條件器本身更為出色。通過糾正量化特徵向量矩陣的正交性，我們增強了對預條件器特徵向量矩陣的逼近，這也有助於計算其逆四次根。此外，我們發現，在量化二階優化器狀態時，線性平方量化略優於動態樹量化。對於各種用於圖像分類的網絡進行評估表明，我們的4位元Shampoo在實現可比的測試準確性的同時更節省內存。源代碼將提供。

English

Second-order optimizers, maintaining a matrix termed a preconditioner, are superior to first-order optimizers in both theory and practice. The states forming the preconditioner and its inverse root restrict the maximum size of models trained by second-order optimizers. To address this, compressing 32-bit optimizer states to lower bitwidths has shown promise in reducing memory usage. However, current approaches only pertain to first-order optimizers. In this paper, we propose the first 4-bit second-order optimizers, exemplified by 4-bit Shampoo, maintaining performance similar to that of 32-bit ones. We show that quantizing the eigenvector matrix of the preconditioner in 4-bit Shampoo is remarkably better than quantizing the preconditioner itself both theoretically and experimentally. By rectifying the orthogonality of the quantized eigenvector matrix, we enhance the approximation of the preconditioner's eigenvector matrix, which also benefits the computation of its inverse 4-th root. Besides, we find that linear square quantization slightly outperforms dynamic tree quantization when quantizing second-order optimizer states. Evaluation on various networks for image classification demonstrates that our 4-bit Shampoo achieves comparable test accuracy to its 32-bit counterpart while being more memory-efficient. The source code will be made available.

記憶效率網絡訓練的4位元洗髮水

4-bit Shampoo for Memory-Efficient Network Training

摘要

Support