BitDelta: 당신의 미세 조정은 단 1비트의 가치만 있을 수 있습니다

초록

대형 언어 모델(LLMs)은 일반적으로 두 단계로 학습됩니다: 대규모 인터넷 규모 데이터셋에 대한 사전 학습과 하위 작업에 대한 미세 조정입니다. 사전 학습의 더 높은 계산 요구량을 고려할 때, 미세 조정이 모델에 덜 새로운 정보를 추가하므로 더 압축 가능하다고 직관적으로 가정할 수 있습니다. 우리는 이 가정을 탐구하기 위해 미세 조정된 모델의 가중치를 사전 학습된 구성 요소와 추가 델타로 분해합니다. 우리는 이 델타를 성능 저하 없이 1비트로 양자화하는 간단한 방법인 BitDelta를 소개합니다. 이 흥미로운 발견은 미세 조정 중 추가된 정보의 잠재적 중복성을 강조할 뿐만 아니라, 미세 조정된 모델의 다중 테넌트 서비스 및 다중 테넌트 저장에 중요한 함의를 가집니다. 단일 고정밀도 기본 모델과 여러 1비트 델타를 함께 사용할 수 있게 함으로써, BitDelta는 GPU 메모리 요구량을 10배 이상 크게 줄일 수 있으며, 이는 다중 테넌트 설정에서 향상된 생성 지연 시간으로도 이어질 수 있습니다. 우리는 Llama-2와 Mistral 모델 패밀리 및 최대 70B 파라미터의 모델에 걸친 실험을 통해 BitDelta를 검증하며, 모든 테스트 설정에서 최소한의 성능 저하를 보여줍니다.

English

Large Language Models (LLMs) are typically trained in two phases: pre-training on large internet-scale datasets, and fine-tuning for downstream tasks. Given the higher computational demand of pre-training, it's intuitive to assume that fine-tuning adds less new information to the model, and is thus more compressible. We explore this assumption by decomposing the weights of fine-tuned models into their pre-trained components and an additional delta. We introduce a simple method, BitDelta, which successfully quantizes this delta down to 1 bit without compromising performance. This interesting finding not only highlights the potential redundancy of information added during fine-tuning, but also has significant implications for the multi-tenant serving and multi-tenant storage of fine-tuned models. By enabling the use of a single high-precision base model accompanied by multiple 1-bit deltas, BitDelta dramatically reduces GPU memory requirements by more than 10x, which can also be translated to enhanced generation latency in multi-tenant settings. We validate BitDelta through experiments across Llama-2 and Mistral model families, and on models up to 70B parameters, showcasing minimal performance degradation over all tested settings.

BitDelta: 당신의 미세 조정은 단 1비트의 가치만 있을 수 있습니다

BitDelta: Your Fine-Tune May Only Be Worth One Bit

초록

Support