모델 수술: 간단한 파라미터 편집을 통해 LLM의 행동 조절하기

초록

대규모 언어 모델(LLM)은 강력한 과제 이해 및 문제 해결 능력을 보여주며 범용 어시스턴트로서의 큰 잠재력을 입증했습니다. LLM을 AI 어시스턴트로 배포하기 위해서는 이러한 모델이 비독성성과 탈옥(jailbreak) 시도에 대한 내성과 같은 바람직한 행동 특성을 보이는 것이 중요합니다. 현재 독성 제거나 탈옥 방지를 위한 방법은 일반적으로 지도 미세 조정(SFT) 또는 인간 피드백을 통한 강화 학습(RLHF)을 포함하며, 이는 상당한 계산 비용을 통해 수십억 개의 매개변수를 경사 하강법으로 미세 조정해야 합니다. 더욱이, SFT와 RLHF를 통해 수정된 모델은 사전 학습된 모델에서 벗어날 수 있으며, 이는 LLM의 기본 기능 저하로 이어질 가능성이 있습니다. 본 논문에서는 놀랍게도 소수의 매개변수를 직접 편집하는 것이 LLM의 특정 행동, 예를 들어 독성 제거 및 탈옥에 대한 저항성을 효과적으로 조절할 수 있음을 관찰했습니다. 구체적으로, 우리가 피하고자 하는 행동에 대해, LLM의 은닉 상태 공간 내에서 이진 행동 레이블을 분류하기 위해 행동 탐침(behavior probe)이라 명명한 선형 분류기를 사용합니다. 이 탐침을 활용하여, 우리는 목표 행동에 상당한 영향을 미치는 LLM 매개변수의 중요한 부분집합을 식별하는 알고리즘을 소개합니다. 그런 다음 선택된 매개변수를 행동 탐침 방향으로 이동시켜 직접 편집합니다. 이러한 직접 매개변수 편집 방법은 추론 수준의 계산 자원만을 필요로 합니다. 실험 결과, 대표적인 독성 제거 작업에서 우리의 접근 방식은 RealToxicityPrompts 데이터셋에서 최대 90.0%, ToxiGen에서 49.2%의 독성 감소를 달성하면서도 상식, 질문 응답, 수학과 같은 LLM의 일반적인 기능을 유지했습니다. 우리의 코드는 https://github.com/lucywang720/model-surgery에서 확인할 수 있습니다.

English

Large Language Models (LLMs) have demonstrated great potential as generalist assistants, showcasing powerful task understanding and problem-solving capabilities. To deploy LLMs as AI assistants, it is crucial that these models exhibit desirable behavioral traits, such as non-toxicity and resilience against jailbreak attempts. Current methods for detoxification or preventing jailbreaking usually involve Supervised Fine-Tuning (SFT) or Reinforcement Learning from Human Feedback (RLHF), which requires finetuning billions of parameters through gradient descent with substantial computation cost. Furthermore, models modified through SFT and RLHF may deviate from the pretrained models, potentially leading to a degradation in foundational LLM capabilities. In this paper, we observe that surprisingly, directly editing a small subset of parameters can effectively modulate specific behaviors of LLMs, such as detoxification and resistance to jailbreaking. Specifically, for a behavior that we aim to avoid, we employ a linear classifier, which we term the behavior probe, to classify binary behavior labels within the hidden state space of the LLM. Using this probe, we introduce an algorithm to identify a critical subset of LLM parameters that significantly influence this targeted behavior. Then we directly edit these selected parameters by shifting them towards the behavior probe. Such a direct parameter editing method necessitates only inference-level computational resources. Experiments demonstrate that in the representative detoxification task, our approach achieves reductions of up to 90.0\% in toxicity on the RealToxicityPrompts dataset and 49.2\% on ToxiGen, while maintaining the LLM's general capabilities in areas such as common sense, question answering, and mathematics. Our code is available at https://github.com/lucywang720/model-surgery.

모델 수술: 간단한 파라미터 편집을 통해 LLM의 행동 조절하기

Model Surgery: Modulating LLM's Behavior Via Simple Parameter Editing

초록

Support