QuEST: 1 ビットの重みと活性化関数を持つLLMの安定したトレーニング

要旨

大規模言語モデル（LLMs）の膨大なコストを削減するアプローチの1つは、トレーニングや展開において量子化されたまたは疎な表現を使用することです。トレーニング後の圧縮手法は非常に人気がありますが、そのような表現を直接トレーニングすることでより正確な圧縮モデルを得る問題、つまり、量子化感知トレーニング（QAT）はまだ解決されていません。例えば、最近の研究（arXiv:2411.04330v2）では、QATを使用してトレーニングできる「最適な」ビット幅を、標準のFP16/BF16精度と同等の精度を維持しながら、重みと活性化に8ビットを設定しました。私たちは、QuESTと呼ばれる新しい手法によってこの最先端技術を進化させ、FP16とパレート競争力を持ち、つまり、より低いモデルサイズでより優れた精度を提供し、重みと活性化を4ビット以下でトレーニングします。さらに、QuESTは1ビットの重みと活性化で安定したトレーニングを可能にします。QuESTは、QAT手法の2つの重要な側面を改善することによってこれを達成します：（1）Hadamard正規化およびMSE最適適合を介した重みと活性化の（連続的な）分布の正確で高速な量子化；（2）量子化された状態上で計算されたノイズの勾配と「真の」（しかし未知の）フル精度勾配との誤差を明示的に最小化するアイデアに基づいた新しい信頼勾配推定器。Llama型アーキテクチャ上の実験では、QuESTがハードウェアでサポートされる精度の全範囲にわたって安定したスケーリング則を導入し、疎な表現に拡張できることを示しています。QuESTによって生成されたモデルは効率的に実行できることを示すGPUカーネルサポートを提供します。当社のコードはhttps://github.com/IST-DASLab/QuESTで入手可能です。

English

One approach to reducing the massive costs of large language models (LLMs) is the use of quantized or sparse representations for training or deployment. While post-training compression methods are very popular, the question of obtaining even more accurate compressed models by directly training over such representations, i.e., Quantization-Aware Training (QAT), is still open: for example, a recent study (arXiv:2411.04330v2) put the "optimal" bit-width at which models can be trained using QAT, while staying accuracy-competitive with standard FP16/BF16 precision, at 8-bits weights and activations. We advance this state-of-the-art via a new method called QuEST, which is Pareto-competitive with FP16, i.e., it provides better accuracy at lower model size, while training models with weights and activations in 4-bits or less. Moreover, QuEST allows stable training with 1-bit weights and activations. QuEST achieves this by improving two key aspects of QAT methods: (1) accurate and fast quantization of the (continuous) distributions of weights and activations via Hadamard normalization and MSE-optimal fitting; (2) a new trust gradient estimator based on the idea of explicitly minimizing the error between the noisy gradient computed over quantized states and the "true" (but unknown) full-precision gradient. Experiments on Llama-type architectures show that QuEST induces stable scaling laws across the entire range of hardware-supported precisions, and can be extended to sparse representations. We provide GPU kernel support showing that models produced by QuEST can be executed efficiently. Our code is available at https://github.com/IST-DASLab/QuEST.

QuEST: 1 ビットの重みと活性化関数を持つLLMの安定したトレーニング

QuEST: Stable Training of LLMs with 1-Bit Weights and Activations

要旨

Support