模型手术：通过简单参数编辑调节LLM的行为

摘要

大型语言模型（LLMs）展示了作为通用助手的巨大潜力，展示了强大的任务理解和问题解决能力。要将LLMs部署为人工智能助手，这些模型展现出理想的行为特征至关重要，如无毒性和抗越狱攻击的弹性。目前用于解毒或防止越狱的方法通常涉及监督微调（SFT）或从人类反馈中进行强化学习（RLHF），需要通过梯度下降对数十亿参数进行微调，计算成本相当高。此外，通过SFT和RLHF修改的模型可能偏离预训练模型，潜在导致LLM基本能力下降。在本文中，我们观察到令人惊讶的是，直接编辑一小部分参数可以有效调节LLMs的特定行为，如解毒和抗越狱。具体来说，对于我们希望避免的行为，我们使用线性分类器，称之为行为探针，来在LLMs的隐藏状态空间内对二进制行为标签进行分类。利用这个探针，我们引入了一种算法来识别显著影响目标行为的LLMs参数的关键子集。然后，我们通过将这些选定的参数直接向行为探针移动来直接编辑这些参数。这种直接参数编辑方法仅需要推理级别的计算资源。实验证明，在代表性的解毒任务中，我们的方法在RealToxicityPrompts数据集上实现了高达90.0\%的毒性降低，以及在ToxiGen上的49.2%，同时保持了LLMs在常识、问答和数学等领域的通用能力。我们的代码可在https://github.com/lucywang720/model-surgery找到。

English

Large Language Models (LLMs) have demonstrated great potential as generalist assistants, showcasing powerful task understanding and problem-solving capabilities. To deploy LLMs as AI assistants, it is crucial that these models exhibit desirable behavioral traits, such as non-toxicity and resilience against jailbreak attempts. Current methods for detoxification or preventing jailbreaking usually involve Supervised Fine-Tuning (SFT) or Reinforcement Learning from Human Feedback (RLHF), which requires finetuning billions of parameters through gradient descent with substantial computation cost. Furthermore, models modified through SFT and RLHF may deviate from the pretrained models, potentially leading to a degradation in foundational LLM capabilities. In this paper, we observe that surprisingly, directly editing a small subset of parameters can effectively modulate specific behaviors of LLMs, such as detoxification and resistance to jailbreaking. Specifically, for a behavior that we aim to avoid, we employ a linear classifier, which we term the behavior probe, to classify binary behavior labels within the hidden state space of the LLM. Using this probe, we introduce an algorithm to identify a critical subset of LLM parameters that significantly influence this targeted behavior. Then we directly edit these selected parameters by shifting them towards the behavior probe. Such a direct parameter editing method necessitates only inference-level computational resources. Experiments demonstrate that in the representative detoxification task, our approach achieves reductions of up to 90.0\% in toxicity on the RealToxicityPrompts dataset and 49.2\% on ToxiGen, while maintaining the LLM's general capabilities in areas such as common sense, question answering, and mathematics. Our code is available at https://github.com/lucywang720/model-surgery.

模型手术：通过简单参数编辑调节LLM的行为

Model Surgery: Modulating LLM's Behavior Via Simple Parameter Editing

摘要

Support