BitNet蒸餾法
BitNet Distillation
October 15, 2025
作者: Xun Wu, Shaohan Huang, Wenhui Wang, Ting Song, Li Dong, Yan Xia, Furu Wei
cs.AI
摘要
本文提出了一種名為BitNet蒸餾(BitDistill)的輕量級流程,該流程將現成的全精度大型語言模型(例如Qwen)針對特定下游任務微調至1.58位精度(即三元權重{-1, 0, 1}),在最小化計算成本的同時實現了強勁的任務特定性能。具體而言,BitDistill整合了三項關鍵技術:BitNet中引入的SubLN模塊、基於MiniLM的多頭注意力蒸餾,以及作為關鍵預熱步驟的持續預訓練,以緩解在特定任務上微調全精度與1.58位大型語言模型之間性能差距的可擴展性問題。實驗結果表明,BitDistill在模型大小上達到了與全精度對應模型相當的性能,同時在CPU上實現了高達10倍的內存節省和2.65倍的推理加速。代碼可在https://github.com/microsoft/BitNet獲取。
English
In this paper, we present BitNet Distillation (BitDistill), a lightweight
pipeline that fine-tunes off-the-shelf full-precision LLMs (e.g., Qwen) into
1.58-bit precision (i.e., ternary weights {-1, 0, 1}) for specific downstream
tasks, achieving strong task-specific performance with minimal computational
cost. Specifically, BitDistill incorporates three key techniques: the SubLN
module, as introduced in BitNet; multi-head attention distillation, based on
MiniLM; and continual pre-training, which serves as a crucial warm-up step to
mitigate the scalability issue of the performance gap between finetuned
full-precision and 1.58-bit LLMs on specific tasks. Experimental results show
that BitDistill achieves performance comparable to the full-precision
counterpart models across model size, while enabling up to 10x memory savings
and 2.65x faster inference on CPUs. Code is available at
https://github.com/microsoft/BitNet.