안전 산술: 매개변수와 활성화를 조정하여 언어 모델의 테스트 시점 안전 정렬을 위한 프레임워크

초록

대규모 언어 모델(LLM)이 번역 및 질의응답과 같은 애플리케이션에서 필수적인 역할을 하면서, 이를 인간의 가치와 안전하게 정렬시키는 것은 매우 중요합니다. 현재의 정렬 방법은 동적인 사용자 의도와 복잡한 목표를 다루는 데 어려움을 겪어, 모델이 유해한 콘텐츠를 생성할 위험에 노출됩니다. 우리는 Safety Arithmetic이라는 훈련이 필요 없는 프레임워크를 제안하며, 이는 기본 모델, 지도 학습을 통한 미세 조정 모델(SFT), 그리고 편집된 모델 등 다양한 시나리오에서 LLM의 안전성을 강화합니다. Safety Arithmetic은 유해한 콘텐츠를 피하기 위한 Harm Direction Removal과 안전한 응답을 촉진하기 위한 Safety Alignment로 구성됩니다. 또한, 의도하지 않게 사용될 경우 모델의 안전성을 훼손할 수 있는 편집 사례를 강조하는 NoIntentEdit 데이터셋을 제시합니다. 우리의 실험 결과, Safety Arithmetic은 안전성 지표를 크게 개선하고, 과도한 안전성을 줄이며, 모델의 유용성을 유지함으로써 기존 방법을 능가하는 안전한 콘텐츠 생성을 보장합니다.

English

Ensuring the safe alignment of large language models (LLMs) with human values is critical as they become integral to applications like translation and question answering. Current alignment methods struggle with dynamic user intentions and complex objectives, making models vulnerable to generating harmful content. We propose Safety Arithmetic, a training-free framework enhancing LLM safety across different scenarios: Base models, Supervised fine-tuned models (SFT), and Edited models. Safety Arithmetic involves Harm Direction Removal to avoid harmful content and Safety Alignment to promote safe responses. Additionally, we present NoIntentEdit, a dataset highlighting edit instances that could compromise model safety if used unintentionally. Our experiments show that Safety Arithmetic significantly improves safety measures, reduces over-safety, and maintains model utility, outperforming existing methods in ensuring safe content generation.

안전 산술: 매개변수와 활성화를 조정하여 언어 모델의 테스트 시점 안전 정렬을 위한 프레임워크

Safety Arithmetic: A Framework for Test-time Safety Alignment of Language Models by Steering Parameters and Activations

초록

Support