模型手術：通過簡單參數編輯調節LLM的行為

摘要

大型語言模型（LLMs）展示了作為通用助手的巨大潛力，展示了強大的任務理解和問題解決能力。要將LLMs部署為人工智能助手，這些模型展現出良好的行為特徵至關重要，如無毒性和對越獄企圖的抵抗力。目前的解毒或防止越獄的方法通常涉及監督微調（SFT）或從人類反饋中進行強化學習（RLHF），這需要通過梯度下降對數十億參數進行微調，並具有可觀的計算成本。此外，通過SFT和RLHF修改的模型可能偏離預訓練模型，可能導致LLM基本能力的下降。在本文中，我們觀察到令人驚訝的是，直接編輯一小部分參數可以有效調節LLMs的特定行為，如解毒和抵抗越獄。具體來說，對於我們希望避免的行為，我們使用線性分類器，我們稱之為行為探針，來在LLM的隱藏狀態空間內對二元行為標籤進行分類。使用此探針，我們引入了一種算法來識別顯著影響這種目標行為的LLM參數的關鍵子集。然後，我們通過將這些選定的參數直接編輯，將它們向行為探針移動。這種直接參數編輯方法僅需要推理級別的計算資源。實驗表明，在代表性的解毒任務中，我們的方法在RealToxicityPrompts數據集上實現了高達90.0％的毒性降低，並在ToxiGen上實現了49.2％，同時保持了LLM在常識、問答和數學等領域的通用能力。我們的代碼可在https://github.com/lucywang720/model-surgery找到。

English

Large Language Models (LLMs) have demonstrated great potential as generalist assistants, showcasing powerful task understanding and problem-solving capabilities. To deploy LLMs as AI assistants, it is crucial that these models exhibit desirable behavioral traits, such as non-toxicity and resilience against jailbreak attempts. Current methods for detoxification or preventing jailbreaking usually involve Supervised Fine-Tuning (SFT) or Reinforcement Learning from Human Feedback (RLHF), which requires finetuning billions of parameters through gradient descent with substantial computation cost. Furthermore, models modified through SFT and RLHF may deviate from the pretrained models, potentially leading to a degradation in foundational LLM capabilities. In this paper, we observe that surprisingly, directly editing a small subset of parameters can effectively modulate specific behaviors of LLMs, such as detoxification and resistance to jailbreaking. Specifically, for a behavior that we aim to avoid, we employ a linear classifier, which we term the behavior probe, to classify binary behavior labels within the hidden state space of the LLM. Using this probe, we introduce an algorithm to identify a critical subset of LLM parameters that significantly influence this targeted behavior. Then we directly edit these selected parameters by shifting them towards the behavior probe. Such a direct parameter editing method necessitates only inference-level computational resources. Experiments demonstrate that in the representative detoxification task, our approach achieves reductions of up to 90.0\% in toxicity on the RealToxicityPrompts dataset and 49.2\% on ToxiGen, while maintaining the LLM's general capabilities in areas such as common sense, question answering, and mathematics. Our code is available at https://github.com/lucywang720/model-surgery.

模型手術：通過簡單參數編輯調節LLM的行為

Model Surgery: Modulating LLM's Behavior Via Simple Parameter Editing

摘要

Support