安全演算：パラメータと活性化を制御する言語モデルのテスト時安全整合化フレームワーク

要旨

大規模言語モデル（LLM）が翻訳や質問応答などのアプリケーションに不可欠となるにつれ、人間の価値観との安全な整合を確保することが極めて重要です。現在の整合手法は、動的なユーザー意図や複雑な目的に対応するのに苦労しており、モデルが有害なコンテンツを生成するリスクを抱えています。本論文では、Safety Arithmeticというトレーニング不要のフレームワークを提案します。このフレームワークは、ベースモデル、教師ありファインチューニングモデル（SFT）、編集済みモデルといった異なるシナリオにおいてLLMの安全性を向上させます。Safety Arithmeticは、有害なコンテンツを回避するためのHarm Direction Removalと、安全な応答を促進するためのSafety Alignmentを含みます。さらに、意図せず使用された場合にモデルの安全性を損なう可能性のある編集事例を強調したデータセットNoIntentEditを提示します。実験結果から、Safety Arithmeticが安全性の指標を大幅に改善し、過剰な安全性を低減しつつモデルの有用性を維持し、安全なコンテンツ生成において既存の手法を凌駕することが示されました。

English

Ensuring the safe alignment of large language models (LLMs) with human values is critical as they become integral to applications like translation and question answering. Current alignment methods struggle with dynamic user intentions and complex objectives, making models vulnerable to generating harmful content. We propose Safety Arithmetic, a training-free framework enhancing LLM safety across different scenarios: Base models, Supervised fine-tuned models (SFT), and Edited models. Safety Arithmetic involves Harm Direction Removal to avoid harmful content and Safety Alignment to promote safe responses. Additionally, we present NoIntentEdit, a dataset highlighting edit instances that could compromise model safety if used unintentionally. Our experiments show that Safety Arithmetic significantly improves safety measures, reduces over-safety, and maintains model utility, outperforming existing methods in ensuring safe content generation.

安全演算：パラメータと活性化を制御する言語モデルのテスト時安全整合化フレームワーク

Safety Arithmetic: A Framework for Test-time Safety Alignment of Language Models by Steering Parameters and Activations

要旨

Support