モデル手術：単純なパラメータ編集によるLLMの行動変調

要旨

大規模言語モデル（LLMs）は、汎用的なアシスタントとしての大きな可能性を示し、強力なタスク理解と問題解決能力を発揮している。LLMsをAIアシスタントとして展開するためには、これらのモデルが非毒性やジャイルブレイク攻撃に対する耐性など、望ましい行動特性を示すことが重要である。現在、脱毒性やジャイルブレイク防止のための手法は、通常、教師あり微調整（SFT）や人間のフィードバックからの強化学習（RLHF）を必要とし、膨大な計算コストを伴う勾配降下法を通じて数十億のパラメータを微調整する。さらに、SFTやRLHFを通じて修正されたモデルは、事前学習モデルから逸脱する可能性があり、基礎的なLLM能力の低下を招く恐れがある。本論文では、驚くべきことに、直接的に少数のパラメータを編集することで、LLMsの特定の行動、例えば脱毒性やジャイルブレイク耐性を効果的に調整できることを観察した。具体的には、回避したい行動に対して、我々は行動プローブと呼ぶ線形分類器を用いて、LLMの隠れ状態空間内で二値行動ラベルを分類する。このプローブを使用して、対象となる行動に大きく影響を与えるLLMパラメータの重要なサブセットを特定するアルゴリズムを導入する。次に、これらの選択されたパラメータを行動プローブに向かってシフトさせることで直接編集する。このような直接的なパラメータ編集手法は、推論レベルの計算リソースのみを必要とする。実験では、代表的な脱毒性タスクにおいて、我々のアプローチがRealToxicityPromptsデータセットで最大90.0%、ToxiGenで49.2%の毒性低減を達成し、常識、質問応答、数学などの領域におけるLLMの一般的な能力を維持することを示した。我々のコードはhttps://github.com/lucywang720/model-surgeryで公開されている。

English

Large Language Models (LLMs) have demonstrated great potential as generalist assistants, showcasing powerful task understanding and problem-solving capabilities. To deploy LLMs as AI assistants, it is crucial that these models exhibit desirable behavioral traits, such as non-toxicity and resilience against jailbreak attempts. Current methods for detoxification or preventing jailbreaking usually involve Supervised Fine-Tuning (SFT) or Reinforcement Learning from Human Feedback (RLHF), which requires finetuning billions of parameters through gradient descent with substantial computation cost. Furthermore, models modified through SFT and RLHF may deviate from the pretrained models, potentially leading to a degradation in foundational LLM capabilities. In this paper, we observe that surprisingly, directly editing a small subset of parameters can effectively modulate specific behaviors of LLMs, such as detoxification and resistance to jailbreaking. Specifically, for a behavior that we aim to avoid, we employ a linear classifier, which we term the behavior probe, to classify binary behavior labels within the hidden state space of the LLM. Using this probe, we introduce an algorithm to identify a critical subset of LLM parameters that significantly influence this targeted behavior. Then we directly edit these selected parameters by shifting them towards the behavior probe. Such a direct parameter editing method necessitates only inference-level computational resources. Experiments demonstrate that in the representative detoxification task, our approach achieves reductions of up to 90.0\% in toxicity on the RealToxicityPrompts dataset and 49.2\% on ToxiGen, while maintaining the LLM's general capabilities in areas such as common sense, question answering, and mathematics. Our code is available at https://github.com/lucywang720/model-surgery.

モデル手術：単純なパラメータ編集によるLLMの行動変調

Model Surgery: Modulating LLM's Behavior Via Simple Parameter Editing

要旨

Summary

Support

Support