SafeInfer:针对大型语言模型的上下文自适应解码时间安全对齐。
SafeInfer: Context Adaptive Decoding Time Safety Alignment for Large Language Models
June 18, 2024
作者: Somnath Banerjee, Soham Tripathy, Sayan Layek, Shanu Kumar, Animesh Mukherjee, Rima Hazra
cs.AI
摘要
安全对齐语言模型通常表现出脆弱和不平衡的安全机制,增加了生成不安全内容的可能性。此外,通过编辑技术将新知识纳入语言模型可能进一步损害安全性。为了解决这些问题,我们提出了SafeInfer,这是一种上下文自适应、解码时安全对齐策略,用于生成对用户查询安全的响应。SafeInfer包括两个阶段:安全增强阶段,利用安全演示示例来调整模型的隐藏状态,增加生成更安全输出的可能性;以及安全引导解码阶段,根据安全优化分布影响标记选择,确保生成的内容符合伦理指南。此外,我们提出了HarmEval,一个新颖的用于广泛安全评估的基准,旨在根据领先人工智能科技巨头的政策,解决潜在的滥用场景。
English
Safety-aligned language models often exhibit fragile and imbalanced safety
mechanisms, increasing the likelihood of generating unsafe content. In
addition, incorporating new knowledge through editing techniques to language
models can further compromise safety. To address these issues, we propose
SafeInfer, a context-adaptive, decoding-time safety alignment strategy for
generating safe responses to user queries. SafeInfer comprises two phases: the
safety amplification phase, which employs safe demonstration examples to adjust
the model's hidden states and increase the likelihood of safer outputs, and the
safety-guided decoding phase, which influences token selection based on
safety-optimized distributions, ensuring the generated content complies with
ethical guidelines. Further, we present HarmEval, a novel benchmark for
extensive safety evaluations, designed to address potential misuse scenarios in
accordance with the policies of leading AI tech giants.