前瞻調校:通過部分答案預覽打造更安全的語言模型
LookAhead Tuning: Safer Language Models via Partial Answer Previews
March 24, 2025
作者: Kangwei Liu, Mengru Wang, Yujie Luo, Lin Yuan, Mengshu Sun, Ningyu Zhang, Lei Liang, Zhiqiang Zhang, Jun Zhou, Huajun Chen
cs.AI
摘要
微調使大型語言模型(LLMs)能夠適應特定領域,但往往會削弱其先前建立的安全對齊。為減輕微調過程中模型安全性的退化,我們引入了前瞻調優(LookAhead Tuning),該方法包含兩種簡單、低資源且有效的數據驅動方法,通過預覽部分答案前綴來修改訓練數據。這兩種方法均旨在通過最小化對初始詞元分佈的擾動,來保護模型的內在安全機制。全面的實驗表明,前瞻調優在保持模型安全性的同時,並未犧牲在下游任務上的穩健性能。我們的研究結果將前瞻調優定位為一種可靠且高效的解決方案,用於實現LLMs的安全有效適應。代碼已發佈於https://github.com/zjunlp/LookAheadTuning。
English
Fine-tuning enables large language models (LLMs) to adapt to specific
domains, but often undermines their previously established safety alignment. To
mitigate the degradation of model safety during fine-tuning, we introduce
LookAhead Tuning, which comprises two simple, low-resource, and effective
data-driven methods that modify training data by previewing partial answer
prefixes. Both methods aim to preserve the model's inherent safety mechanisms
by minimizing perturbations to initial token distributions. Comprehensive
experiments demonstrate that LookAhead Tuning effectively maintains model
safety without sacrificing robust performance on downstream tasks. Our findings
position LookAhead Tuning as a reliable and efficient solution for the safe and
effective adaptation of LLMs. Code is released at
https://github.com/zjunlp/LookAheadTuning.Summary
AI-Generated Summary