比特增量：您的微调也许只值一个比特

摘要

大型语言模型（LLMs）通常分为两个阶段进行训练：在大规模互联网数据集上进行预训练，以及为下游任务进行微调。考虑到预训练的更高计算需求，直觉上可以认为微调向模型添加了较少的新信息，因此更易压缩。我们通过将微调模型的权重分解为其预训练组件和额外的增量来探讨这一假设。我们引入了一种简单的方法，BitDelta，成功将这个增量量化为1比特而不影响性能。这一有趣的发现不仅突显了微调过程中添加信息的潜在冗余性，还对微调模型的多租户服务和多租户存储产生了重要影响。通过使用单个高精度基础模型并附带多个1比特增量，BitDelta大幅减少了GPU内存需求超过10倍，这也可以转化为多租户环境中的生成延迟提升。我们通过在Llama-2和Mistral模型系列以及长达70B参数的模型上进行实验证实了BitDelta，在所有测试设置中展示了最小的性能降级。

English

Large Language Models (LLMs) are typically trained in two phases: pre-training on large internet-scale datasets, and fine-tuning for downstream tasks. Given the higher computational demand of pre-training, it's intuitive to assume that fine-tuning adds less new information to the model, and is thus more compressible. We explore this assumption by decomposing the weights of fine-tuned models into their pre-trained components and an additional delta. We introduce a simple method, BitDelta, which successfully quantizes this delta down to 1 bit without compromising performance. This interesting finding not only highlights the potential redundancy of information added during fine-tuning, but also has significant implications for the multi-tenant serving and multi-tenant storage of fine-tuned models. By enabling the use of a single high-precision base model accompanied by multiple 1-bit deltas, BitDelta dramatically reduces GPU memory requirements by more than 10x, which can also be translated to enhanced generation latency in multi-tenant settings. We validate BitDelta through experiments across Llama-2 and Mistral model families, and on models up to 70B parameters, showcasing minimal performance degradation over all tested settings.

比特增量：您的微调也许只值一个比特

BitDelta: Your Fine-Tune May Only Be Worth One Bit

摘要

Support