模型保留的自適應捨入

摘要

後訓練量化（PTQ）的主要目標是生成一個壓縮模型，其輸出分佈盡可能接近原始模型。為了實現這一目標，幾乎所有大型語言模型（LLM）的PTQ算法都通過獨立最小化即時激活誤差來量化線性層。然而，這種局部目標忽略了後續層的影響，因此減少它並不一定能帶來更接近的模型。在本研究中，我們引入了另一種量化算法（YAQA），這是一種自適應舍入算法，它使用克羅內克分解近似每個線性層相對於完整模型KL散度的海森矩陣。YAQA由兩個組件組成：可以為數百億參數的LLM高效計算的完整層級海森矩陣的克羅內克分解草圖，以及一個獨立於量化器的舍入算法，該算法使用這些草圖並具有理論保證。在多種模型和量化器的廣泛範圍內，YAQA在將KL散度減少約30%的同時，在下游任務中實現了最先進的性能。

English

The main goal of post-training quantization (PTQ) is to produced a compressed model whose output distribution is as close to the original model's as possible. To do this tractably, almost all LLM PTQ algorithms quantize linear layers by independently minimizing the immediate activation error. However, this localized objective ignores the effect of subsequent layers, so reducing it does not necessarily give a closer model. In this work, we introduce Yet Another Quantization Algorithm (YAQA), an adaptive rounding algorithm that uses Kronecker-factored approximations of each linear layer's Hessian with respect to the full model KL divergence. YAQA consists of two components: Kronecker-factored sketches of the full layerwise Hessian that can be tractably computed for hundred-billion parameter LLMs, and a quantizer-independent rounding algorithm that uses these sketches and comes with theoretical guarantees. Across a wide range of models and quantizers, YAQA empirically reduces the KL divergence to the original model by approx 30% while achieving state of the art performance on downstream tasks.

模型保留的自適應捨入

Model-Preserving Adaptive Rounding

摘要

Support