前瞻調校：通過部分答案預覽打造更安全的語言模型

摘要

微調使大型語言模型（LLMs）能夠適應特定領域，但往往會削弱其先前建立的安全對齊。為減輕微調過程中模型安全性的退化，我們引入了前瞻調優（LookAhead Tuning），該方法包含兩種簡單、低資源且有效的數據驅動方法，通過預覽部分答案前綴來修改訓練數據。這兩種方法均旨在通過最小化對初始詞元分佈的擾動，來保護模型的內在安全機制。全面的實驗表明，前瞻調優在保持模型安全性的同時，並未犧牲在下游任務上的穩健性能。我們的研究結果將前瞻調優定位為一種可靠且高效的解決方案，用於實現LLMs的安全有效適應。代碼已發佈於https://github.com/zjunlp/LookAheadTuning。

English

Fine-tuning enables large language models (LLMs) to adapt to specific domains, but often undermines their previously established safety alignment. To mitigate the degradation of model safety during fine-tuning, we introduce LookAhead Tuning, which comprises two simple, low-resource, and effective data-driven methods that modify training data by previewing partial answer prefixes. Both methods aim to preserve the model's inherent safety mechanisms by minimizing perturbations to initial token distributions. Comprehensive experiments demonstrate that LookAhead Tuning effectively maintains model safety without sacrificing robust performance on downstream tasks. Our findings position LookAhead Tuning as a reliable and efficient solution for the safe and effective adaptation of LLMs. Code is released at https://github.com/zjunlp/LookAheadTuning.

前瞻調校：通過部分答案預覽打造更安全的語言模型

LookAhead Tuning: Safer Language Models via Partial Answer Previews

摘要

Support