前瞻调优：通过部分答案预览打造更安全的语言模型

摘要

微调使大型语言模型（LLMs）能够适应特定领域，但往往会削弱其先前建立的安全对齐。为了缓解微调过程中模型安全性的退化，我们引入了前瞻调优（LookAhead Tuning），它包含两种简单、低资源且有效的数据驱动方法，通过预览部分答案前缀来修改训练数据。这两种方法旨在通过最小化对初始令牌分布的扰动，来保持模型固有的安全机制。全面的实验表明，前瞻调优在有效维护模型安全性的同时，不牺牲下游任务的稳健性能。我们的研究结果将前瞻调优定位为一种可靠且高效的解决方案，用于LLMs的安全有效适应。代码已发布于https://github.com/zjunlp/LookAheadTuning。

English

Fine-tuning enables large language models (LLMs) to adapt to specific domains, but often undermines their previously established safety alignment. To mitigate the degradation of model safety during fine-tuning, we introduce LookAhead Tuning, which comprises two simple, low-resource, and effective data-driven methods that modify training data by previewing partial answer prefixes. Both methods aim to preserve the model's inherent safety mechanisms by minimizing perturbations to initial token distributions. Comprehensive experiments demonstrate that LookAhead Tuning effectively maintains model safety without sacrificing robust performance on downstream tasks. Our findings position LookAhead Tuning as a reliable and efficient solution for the safe and effective adaptation of LLMs. Code is released at https://github.com/zjunlp/LookAheadTuning.

前瞻调优：通过部分答案预览打造更安全的语言模型

LookAhead Tuning: Safer Language Models via Partial Answer Previews

摘要

Support