SafeInfer:針對大型語言模型的上下文自適應解碼時間安全對齊
SafeInfer: Context Adaptive Decoding Time Safety Alignment for Large Language Models
June 18, 2024
作者: Somnath Banerjee, Soham Tripathy, Sayan Layek, Shanu Kumar, Animesh Mukherjee, Rima Hazra
cs.AI
摘要
安全導向的語言模型常常表現出脆弱且不平衡的安全機制,增加生成不安全內容的可能性。此外,透過編輯技術將新知識納入語言模型可能進一步損害安全性。為解決這些問題,我們提出了SafeInfer,一種適應上下文、解碼時安全對齊策略,用於生成對用戶查詢安全的回應。SafeInfer 包括兩個階段:安全增強階段,利用安全示範例調整模型的隱藏狀態,增加生成更安全輸出的可能性;以及安全導向解碼階段,根據安全優化分佈影響標記選擇,確保生成的內容符合道德準則。此外,我們提出了HarmEval,一個用於廣泛安全評估的新基準,旨在應對潛在的誤用情境,符合領先人工智慧科技巨頭的政策。
English
Safety-aligned language models often exhibit fragile and imbalanced safety
mechanisms, increasing the likelihood of generating unsafe content. In
addition, incorporating new knowledge through editing techniques to language
models can further compromise safety. To address these issues, we propose
SafeInfer, a context-adaptive, decoding-time safety alignment strategy for
generating safe responses to user queries. SafeInfer comprises two phases: the
safety amplification phase, which employs safe demonstration examples to adjust
the model's hidden states and increase the likelihood of safer outputs, and the
safety-guided decoding phase, which influences token selection based on
safety-optimized distributions, ensuring the generated content complies with
ethical guidelines. Further, we present HarmEval, a novel benchmark for
extensive safety evaluations, designed to address potential misuse scenarios in
accordance with the policies of leading AI tech giants.Summary
AI-Generated Summary