尺度混合:針對大型語言模型的記憶效率型態自適應二值化
Mixture of Scales: Memory-Efficient Token-Adaptive Binarization for Large Language Models
June 18, 2024
作者: Dongwon Jo, Taesu Kim, Yulhwa Kim, Jae-Joon Kim
cs.AI
摘要
二值化是一種將權重參數轉換為二進制值的有效策略,用於減小大型語言模型(LLMs)的尺寸。然而,典型的二值化技術顯著降低了LLMs的語言效能。為解決此問題,我們引入了一種新穎的二值化技術,稱為混合尺度(BinaryMoS)。與傳統方法不同,BinaryMoS採用多個用於二值權重的尺度專家,動態地將這些專家合併用於每個標記,以自適應生成尺度因子。這種標記自適應方法通過使二值化LLMs的表示能力提升,使得能夠對二進制權重的值進行上下文調整。此外,由於此自適應過程僅涉及尺度因子而非整個權重矩陣,因此BinaryMoS保持了與傳統靜態二值化方法類似的壓縮效率。我們的實驗結果顯示,BinaryMoS在各種自然語言處理任務中均優於傳統的二值化技術,甚至勝過2位量化方法,同時保持與靜態二值化技術相似的模型尺寸。
English
Binarization, which converts weight parameters to binary values, has emerged
as an effective strategy to reduce the size of large language models (LLMs).
However, typical binarization techniques significantly diminish linguistic
effectiveness of LLMs. To address this issue, we introduce a novel binarization
technique called Mixture of Scales (BinaryMoS). Unlike conventional methods,
BinaryMoS employs multiple scaling experts for binary weights, dynamically
merging these experts for each token to adaptively generate scaling factors.
This token-adaptive approach boosts the representational power of binarized
LLMs by enabling contextual adjustments to the values of binary weights.
Moreover, because this adaptive process only involves the scaling factors
rather than the entire weight matrix, BinaryMoS maintains compression
efficiency similar to traditional static binarization methods. Our experimental
results reveal that BinaryMoS surpasses conventional binarization techniques in
various natural language processing tasks and even outperforms 2-bit
quantization methods, all while maintaining similar model size to static
binarization techniques.Summary
AI-Generated Summary