尺度混合：用于大型语言模型的记忆高效的令牌自适应二值化

摘要

二值化是一种将权重参数转换为二进制值的有效策略，用于减小大型语言模型（LLMs）的尺寸。然而，传统的二值化技术显著降低了LLMs的语言效果。为了解决这个问题，我们引入了一种名为“混合尺度”（BinaryMoS）的新型二值化技术。与传统方法不同，BinaryMoS利用多个尺度专家来处理二进制权重，动态地合并这些专家以为每个标记自适应生成尺度因子。这种标记自适应方法通过使二值化LLMs的表示能力提升，实现了对二进制权重值的上下文调整。此外，由于这种自适应过程仅涉及尺度因子而不是整个权重矩阵，BinaryMoS保持了与传统静态二值化方法相似的压缩效率。我们的实验结果显示，BinaryMoS在各种自然语言处理任务中超越了传统的二值化技术，甚至优于2位量化方法，同时保持了与静态二值化技术相似的模型大小。

English

Binarization, which converts weight parameters to binary values, has emerged as an effective strategy to reduce the size of large language models (LLMs). However, typical binarization techniques significantly diminish linguistic effectiveness of LLMs. To address this issue, we introduce a novel binarization technique called Mixture of Scales (BinaryMoS). Unlike conventional methods, BinaryMoS employs multiple scaling experts for binary weights, dynamically merging these experts for each token to adaptively generate scaling factors. This token-adaptive approach boosts the representational power of binarized LLMs by enabling contextual adjustments to the values of binary weights. Moreover, because this adaptive process only involves the scaling factors rather than the entire weight matrix, BinaryMoS maintains compression efficiency similar to traditional static binarization methods. Our experimental results reveal that BinaryMoS surpasses conventional binarization techniques in various natural language processing tasks and even outperforms 2-bit quantization methods, all while maintaining similar model size to static binarization techniques.

尺度混合：用于大型语言模型的记忆高效的令牌自适应二值化

Mixture of Scales: Memory-Efficient Token-Adaptive Binarization for Large Language Models

摘要

Support