安全算法:通过操纵参数和激活来实现语言模型测试时安全对齐的框架。
Safety Arithmetic: A Framework for Test-time Safety Alignment of Language Models by Steering Parameters and Activations
June 17, 2024
作者: Rima Hazra, Sayan Layek, Somnath Banerjee, Soujanya Poria
cs.AI
摘要
确保大型语言模型(LLMs)与人类价值观安全对齐对于它们成为翻译和问答等应用的关键至关重要。当前的对齐方法在处理动态用户意图和复杂目标时存在困难,使模型容易生成有害内容。我们提出了一种名为“安全算法”的无需训练的框架,可增强LLM在不同场景下的安全性:基础模型、监督微调模型(SFT)和编辑模型。安全算法包括有害方向消除以避免生成有害内容,以及安全对齐以促进生成安全响应。此外,我们提出了一个名为NoIntentEdit的数据集,突出显示可能损害模型安全性的编辑实例,如果不经意间使用的话。我们的实验表明,安全算法显著提高了安全性指标,减少了过度安全性,并保持了模型效用,在确保生成安全内容方面优于现有方法。
English
Ensuring the safe alignment of large language models (LLMs) with human values
is critical as they become integral to applications like translation and
question answering. Current alignment methods struggle with dynamic user
intentions and complex objectives, making models vulnerable to generating
harmful content. We propose Safety Arithmetic, a training-free framework
enhancing LLM safety across different scenarios: Base models, Supervised
fine-tuned models (SFT), and Edited models. Safety Arithmetic involves Harm
Direction Removal to avoid harmful content and Safety Alignment to promote safe
responses. Additionally, we present NoIntentEdit, a dataset highlighting edit
instances that could compromise model safety if used unintentionally. Our
experiments show that Safety Arithmetic significantly improves safety measures,
reduces over-safety, and maintains model utility, outperforming existing
methods in ensuring safe content generation.Summary
AI-Generated Summary