前瞻调优:通过部分答案预览打造更安全的语言模型
LookAhead Tuning: Safer Language Models via Partial Answer Previews
March 24, 2025
作者: Kangwei Liu, Mengru Wang, Yujie Luo, Lin Yuan, Mengshu Sun, Ningyu Zhang, Lei Liang, Zhiqiang Zhang, Jun Zhou, Huajun Chen
cs.AI
摘要
微调使大型语言模型(LLMs)能够适应特定领域,但往往会削弱其先前建立的安全对齐。为了缓解微调过程中模型安全性的退化,我们引入了前瞻调优(LookAhead Tuning),它包含两种简单、低资源且有效的数据驱动方法,通过预览部分答案前缀来修改训练数据。这两种方法旨在通过最小化对初始令牌分布的扰动,来保持模型固有的安全机制。全面的实验表明,前瞻调优在有效维护模型安全性的同时,不牺牲下游任务的稳健性能。我们的研究结果将前瞻调优定位为一种可靠且高效的解决方案,用于LLMs的安全有效适应。代码已发布于https://github.com/zjunlp/LookAheadTuning。
English
Fine-tuning enables large language models (LLMs) to adapt to specific
domains, but often undermines their previously established safety alignment. To
mitigate the degradation of model safety during fine-tuning, we introduce
LookAhead Tuning, which comprises two simple, low-resource, and effective
data-driven methods that modify training data by previewing partial answer
prefixes. Both methods aim to preserve the model's inherent safety mechanisms
by minimizing perturbations to initial token distributions. Comprehensive
experiments demonstrate that LookAhead Tuning effectively maintains model
safety without sacrificing robust performance on downstream tasks. Our findings
position LookAhead Tuning as a reliable and efficient solution for the safe and
effective adaptation of LLMs. Code is released at
https://github.com/zjunlp/LookAheadTuning.Summary
AI-Generated Summary