BitNet蒸餾法

摘要

本文提出了一種名為BitNet蒸餾（BitDistill）的輕量級流程，該流程將現成的全精度大型語言模型（例如Qwen）針對特定下游任務微調至1.58位精度（即三元權重{-1, 0, 1}），在最小化計算成本的同時實現了強勁的任務特定性能。具體而言，BitDistill整合了三項關鍵技術：BitNet中引入的SubLN模塊、基於MiniLM的多頭注意力蒸餾，以及作為關鍵預熱步驟的持續預訓練，以緩解在特定任務上微調全精度與1.58位大型語言模型之間性能差距的可擴展性問題。實驗結果表明，BitDistill在模型大小上達到了與全精度對應模型相當的性能，同時在CPU上實現了高達10倍的內存節省和2.65倍的推理加速。代碼可在https://github.com/microsoft/BitNet獲取。

English

In this paper, we present BitNet Distillation (BitDistill), a lightweight pipeline that fine-tunes off-the-shelf full-precision LLMs (e.g., Qwen) into 1.58-bit precision (i.e., ternary weights {-1, 0, 1}) for specific downstream tasks, achieving strong task-specific performance with minimal computational cost. Specifically, BitDistill incorporates three key techniques: the SubLN module, as introduced in BitNet; multi-head attention distillation, based on MiniLM; and continual pre-training, which serves as a crucial warm-up step to mitigate the scalability issue of the performance gap between finetuned full-precision and 1.58-bit LLMs on specific tasks. Experimental results show that BitDistill achieves performance comparable to the full-precision counterpart models across model size, while enabling up to 10x memory savings and 2.65x faster inference on CPUs. Code is available at https://github.com/microsoft/BitNet.