LookAhead 튜닝: 부분 답변 미리보기를 통한 더 안전한 언어 모델

초록

파인튜닝은 대규모 언어 모델(LLM)이 특정 도메인에 적응할 수 있게 해주지만, 종종 이전에 확립된 안전성 정렬을 약화시킵니다. 파인튜닝 과정에서 모델의 안전성이 저하되는 문제를 완화하기 위해, 우리는 부분적인 답변 접두사를 미리 보는 방식으로 학습 데이터를 수정하는 두 가지 간단하고 저비용이며 효과적인 데이터 기반 방법으로 구성된 LookAhead Tuning을 소개합니다. 두 방법 모두 초기 토큰 분포에 대한 변화를 최소화함으로써 모델의 내재된 안전 메커니즘을 보존하는 것을 목표로 합니다. 포괄적인 실험을 통해 LookAhead Tuning이 하류 작업에서의 강력한 성능을 희생하지 않으면서도 모델의 안전성을 효과적으로 유지한다는 것을 입증했습니다. 우리의 연구 결과는 LookAhead Tuning을 LLM의 안전하고 효과적인 적응을 위한 신뢰할 수 있고 효율적인 솔루션으로 자리매김합니다. 코드는 https://github.com/zjunlp/LookAheadTuning에서 공개되었습니다.

English

Fine-tuning enables large language models (LLMs) to adapt to specific domains, but often undermines their previously established safety alignment. To mitigate the degradation of model safety during fine-tuning, we introduce LookAhead Tuning, which comprises two simple, low-resource, and effective data-driven methods that modify training data by previewing partial answer prefixes. Both methods aim to preserve the model's inherent safety mechanisms by minimizing perturbations to initial token distributions. Comprehensive experiments demonstrate that LookAhead Tuning effectively maintains model safety without sacrificing robust performance on downstream tasks. Our findings position LookAhead Tuning as a reliable and efficient solution for the safe and effective adaptation of LLMs. Code is released at https://github.com/zjunlp/LookAheadTuning.

LookAhead 튜닝: 부분 답변 미리보기를 통한 더 안전한 언어 모델

LookAhead Tuning: Safer Language Models via Partial Answer Previews

초록

Support