ChatPaper.aiChatPaper

關於語言模型蒸餾中的教師模型攻擊

On Teacher Hacking in Language Model Distillation

February 4, 2025
作者: Daniil Tiapkin, Daniele Calandriello, Johan Ferret, Sarah Perrin, Nino Vieillard, Alexandre Ramé, Mathieu Blondel
cs.AI

摘要

語言模型(LM)的後訓練越來越依賴以下兩個階段:(i)知識蒸餾,其中LM被訓練以模仿一個更大的教師LM,以及(ii)從人類反饋中進行強化學習(RLHF),其中LM通過優化獎勵模型來對齊。在第二個RLHF階段中,一個眾所周知的挑戰是獎勵劫持,即LM過度優化獎勵模型。這種現象符合古德哈特定律,可能導致在真實目標上性能下降。在本文中,我們探討了一個類似的現象,我們稱之為教師劫持,是否可能在知識蒸餾期間發生。這可能是因為教師LM本身是對真實分佈的不完美近似。為了研究這一點,我們提出了一個受控的實驗設置,包括:(i)代表地面真實分佈的神諭LM,(ii)從神諭蒸餾出的教師LM,以及(iii)從教師蒸餾出的學生LM。我們的實驗揭示了以下見解。當使用固定的離線數據集進行蒸餾時,教師劫持會發生;此外,我們可以通過觀察優化過程偏離多項式收斂定律時來檢測它。相反,採用在線數據生成技術有效地減輕了教師劫持。更確切地說,我們確定數據多樣性是防止劫持的關鍵因素。總的來說,我們的研究結果深入理解了蒸餾在構建強大和高效LM方面的益處和限制。
English
Post-training of language models (LMs) increasingly relies on the following two stages: (i) knowledge distillation, where the LM is trained to imitate a larger teacher LM, and (ii) reinforcement learning from human feedback (RLHF), where the LM is aligned by optimizing a reward model. In the second RLHF stage, a well-known challenge is reward hacking, where the LM over-optimizes the reward model. Such phenomenon is in line with Goodhart's law and can lead to degraded performance on the true objective. In this paper, we investigate whether a similar phenomenon, that we call teacher hacking, can occur during knowledge distillation. This could arise because the teacher LM is itself an imperfect approximation of the true distribution. To study this, we propose a controlled experimental setup involving: (i) an oracle LM representing the ground-truth distribution, (ii) a teacher LM distilled from the oracle, and (iii) a student LM distilled from the teacher. Our experiments reveal the following insights. When using a fixed offline dataset for distillation, teacher hacking occurs; moreover, we can detect it by observing when the optimization process deviates from polynomial convergence laws. In contrast, employing online data generation techniques effectively mitigates teacher hacking. More precisely, we identify data diversity as the key factor in preventing hacking. Overall, our findings provide a deeper understanding of the benefits and limitations of distillation for building robust and efficient LMs.

Summary

AI-Generated Summary

PDF182February 6, 2025