스케일 혼합: 대규모 언어 모델을 위한 메모리 효율적 토큰 적응형 이진화

초록

가중치 매개변수를 이진 값으로 변환하는 이진화(binarization)는 대규모 언어 모델(LLM)의 크기를 줄이기 위한 효과적인 전략으로 부상했습니다. 그러나 일반적인 이진화 기술은 LLM의 언어적 효율성을 크게 저하시킵니다. 이 문제를 해결하기 위해, 우리는 Mixture of Scales(BinaryMoS)라는 새로운 이진화 기술을 소개합니다. 기존 방법과 달리, BinaryMoS는 이진 가중치를 위해 다중 스케일링 전문가를 사용하며, 각 토큰에 대해 이러한 전문가를 동적으로 병합하여 적응적으로 스케일링 인자를 생성합니다. 이 토큰 적응적 접근 방식은 이진 가중치의 값을 문맥에 맞게 조정할 수 있게 함으로써 이진화된 LLM의 표현력을 향상시킵니다. 또한, 이 적응 과정이 전체 가중치 행렬이 아닌 스케일링 인자만을 포함하기 때문에, BinaryMoS는 기존의 정적 이진화 방법과 유사한 압축 효율성을 유지합니다. 우리의 실험 결과는 BinaryMoS가 다양한 자연어 처리 작업에서 기존 이진화 기술을 능가하며, 심지어 2비트 양자화 방법보다도 우수한 성능을 보이면서도 정적 이진화 기술과 유사한 모델 크기를 유지한다는 것을 보여줍니다.

English

Binarization, which converts weight parameters to binary values, has emerged as an effective strategy to reduce the size of large language models (LLMs). However, typical binarization techniques significantly diminish linguistic effectiveness of LLMs. To address this issue, we introduce a novel binarization technique called Mixture of Scales (BinaryMoS). Unlike conventional methods, BinaryMoS employs multiple scaling experts for binary weights, dynamically merging these experts for each token to adaptively generate scaling factors. This token-adaptive approach boosts the representational power of binarized LLMs by enabling contextual adjustments to the values of binary weights. Moreover, because this adaptive process only involves the scaling factors rather than the entire weight matrix, BinaryMoS maintains compression efficiency similar to traditional static binarization methods. Our experimental results reveal that BinaryMoS surpasses conventional binarization techniques in various natural language processing tasks and even outperforms 2-bit quantization methods, all while maintaining similar model size to static binarization techniques.

스케일 혼합: 대규모 언어 모델을 위한 메모리 효율적 토큰 적응형 이진화

Mixture of Scales: Memory-Efficient Token-Adaptive Binarization for Large Language Models

초록

Support