蒸餾擴展定律
Distillation Scaling Laws
February 12, 2025
作者: Dan Busbridge, Amitis Shidani, Floris Weers, Jason Ramapuram, Etai Littwin, Russ Webb
cs.AI
摘要
我們提供一個蒸餾擴展定律,根據計算預算及其在學生和教師之間的分配,估計蒸餾模型的性能。我們的研究結果降低了在大規模應用蒸餾時所面臨的風險;現在可以對教師和學生模型的計算分配進行優化,以最大化學生的性能。我們提供了計算最佳化的蒸餾配方,當存在教師時,或者需要對教師進行訓練時。如果要對多個學生進行蒸餾,或者已經存在教師,蒸餾的效果優於監督預訓練,直到一個隨著學生規模預測增長的計算水平。如果只有一個學生需要蒸餾,並且還需要對教師進行訓練,則應採用監督學習。此外,我們提供了對蒸餾的大規模研究的見解,這些見解增進了我們對蒸餾的理解,並指導實驗設計。
English
We provide a distillation scaling law that estimates distilled model
performance based on a compute budget and its allocation between the student
and teacher. Our findings reduce the risks associated with using distillation
at scale; compute allocation for both the teacher and student models can now be
done to maximize student performance. We provide compute optimal distillation
recipes for when 1) a teacher exists, or 2) a teacher needs training. If many
students are to be distilled, or a teacher already exists, distillation
outperforms supervised pretraining until a compute level which grows
predictably with student size. If one student is to be distilled and a teacher
also needs training, supervised learning should be done instead. Additionally,
we provide insights across our large scale study of distillation, which
increase our understanding of distillation and inform experimental design.