BitDelta：微調或僅值一位元

摘要

大型語言模型（LLMs）通常分為兩個階段進行訓練：在大規模互聯網數據集上進行預訓練，以及為下游任務進行微調。考慮到預訓練的計算需求較高，直覺上可以假設微調對模型添加的新信息較少，因此更易壓縮。我們通過將微調模型的權重分解為其預訓練組件和額外的增量來探索這一假設。我們引入了一種簡單的方法，BitDelta，成功地將這個增量量化為1位元，而不影響性能。這一有趣的發現不僅突顯了微調過程中添加的信息可能存在的冗餘性，還對微調模型的多租戶服務和多租戶存儲產生了重要影響。通過使用單個高精度基礎模型和多個1位元增量，BitDelta大幅降低了GPU內存需求超過10倍，這也可以轉化為多租戶環境中生成延遲的提升。我們通過在Llama-2和Mistral模型系列以及多達70B參數的模型上進行實驗，展示了在所有測試設置中性能幾乎沒有下降的情況，從而驗證了BitDelta。

English

Large Language Models (LLMs) are typically trained in two phases: pre-training on large internet-scale datasets, and fine-tuning for downstream tasks. Given the higher computational demand of pre-training, it's intuitive to assume that fine-tuning adds less new information to the model, and is thus more compressible. We explore this assumption by decomposing the weights of fine-tuned models into their pre-trained components and an additional delta. We introduce a simple method, BitDelta, which successfully quantizes this delta down to 1 bit without compromising performance. This interesting finding not only highlights the potential redundancy of information added during fine-tuning, but also has significant implications for the multi-tenant serving and multi-tenant storage of fine-tuned models. By enabling the use of a single high-precision base model accompanied by multiple 1-bit deltas, BitDelta dramatically reduces GPU memory requirements by more than 10x, which can also be translated to enhanced generation latency in multi-tenant settings. We validate BitDelta through experiments across Llama-2 and Mistral model families, and on models up to 70B parameters, showcasing minimal performance degradation over all tested settings.

BitDelta：微調或僅值一位元

BitDelta: Your Fine-Tune May Only Be Worth One Bit

摘要

Summary

Support

Support