低ビット量子化は未訓練のLLMを好む：100兆のトレーニングトークンを持つ量子化されたLLMのスケーリング則

要旨

低ビット量子化は、大規模な未訓練の大規模言語モデル（LLM）に有利であることを明らかにしました。より大きなサイズや少ないトレーニングトークンを持つモデルは、低ビット量子化を適用する際に量子化による劣化（QiD）が少なくなる一方、広範なトレーニングトークンを持つより小さなモデルは著しいQiDを被ります。この傾向をより深く理解するために、異なるサイズとトレーニングレベル（未訓練または完全に訓練された）の1500以上の量子化されたLLMチェックポイントを制御された環境で調査し、トレーニングトークンの数、モデルサイズ、ビット幅などの要因とQiDとの関係を理解するためのスケーリング則を導出します。導出されたスケーリング則を用いて、LLMのトレーニングレベルを測定し、さまざまなサイズのLLMを完全に訓練するために必要なトレーニングトークンの数を決定するためにQiDを使用できる新しい視点を提案します。さらに、スケーリング則を使用して、100兆トークンで訓練されたさまざまなサイズのLLMの量子化パフォーマンスを予測します。私たちの予測によると、将来のモデルの低ビット量子化パフォーマンスは、100兆トークン以上で訓練されると予想されるモデルにおいては望ましくない可能性があります。これは、将来の低ビット量子化における潜在的な課題を提起し、低ビット量子化研究を評価する際にモデルのトレーニングレベルを認識する必要性を強調しています。この問題に関する将来の研究を促進するために、この作業で使用された1500以上の量子化されたチェックポイントをすべてhttps://huggingface.co/Xu-Ouyangで公開します。

English

We reveal that low-bit quantization favors undertrained large language models (LLMs) by observing that models with larger sizes or fewer training tokens experience less quantization-induced degradation (QiD) when applying low-bit quantization, whereas smaller models with extensive training tokens suffer significant QiD. To gain deeper insights into this trend, we study over 1500 quantized LLM checkpoints of various sizes and at different training levels (undertrained or fully trained) in a controlled setting, deriving scaling laws for understanding the relationship between QiD and factors such as the number of training tokens, model size and bit width. With the derived scaling laws, we propose a novel perspective that we can use QiD to measure an LLM's training levels and determine the number of training tokens required for fully training LLMs of various sizes. Moreover, we use the scaling laws to predict the quantization performance of different-sized LLMs trained with 100 trillion tokens. Our projection shows that the low-bit quantization performance of future models, which are expected to be trained with over 100 trillion tokens, may NOT be desirable. This poses a potential challenge for low-bit quantization in the future and highlights the need for awareness of a model's training level when evaluating low-bit quantization research. To facilitate future research on this problem, we release all the 1500+ quantized checkpoints used in this work at https://huggingface.co/Xu-Ouyang.

低ビット量子化は未訓練のLLMを好む：100兆のトレーニングトークンを持つ量子化されたLLMのスケーリング則

Low-Bit Quantization Favors Undertrained LLMs: Scaling Laws for Quantized LLMs with 100T Training Tokens

要旨

Support