Be Kind, Rewrite: 재작성을 통한 양성 투영으로 LLM 데이터 중독 공격 방어

초록

대규모 언어 모델(LLM)은 트리거 기반 유해 콘텐츠를 사용하여 학습 샘플을 오염시키는 백도어 공격(BA)에 매우 취약하다. 또한, 기존 방어 기법들은 다양한 BA 패턴에 대해 광범위하게 테스트되었을 때 효과가 입증되지 않았다. 이러한 BA에 더 효과적으로 대응하기 위해, 본 연구는 LLM 재작성을 데이터 중독에 대한 사전 방어 기법으로 활용하는 방안을 탐구한다. 첫째, LLM 재작성이 오픈북 무해 샘플(이하 오픈북 무해 재작성, OBBR)을 활용할 때, 재작성된 출력이 무해할 확률이 폐쇄형 재작성에 비해 엄밀히 더 높다는 것을 이론적으로 증명한다. 따라서 OBBR은 학습 샘플을 무해 프롬프트의 공간으로 투영함으로써 유해 콘텐츠를 무력화한다. 이후, 기존 방어 기법과 달리 OBBR이 다수의 기존 BA를 효과적으로 완화함을 보인다: 다섯 가지 알려진 BA와 네 가지 널리 사용되는 LLM에 걸쳐, OBBR은 최신 BA 방어 기법 대비 평균 51%, 폐쇄형 재작성 방법 대비 25.7% 더 높은 안전 성능을 달성한다. 마지막으로, OBBR은 다른 BA 방어 기법에 비해 계산 효율성이 높고, 미세 조정 후 자연어 처리 작업에서 모델 성능을 저하시키지 않으며, 트리거 기반이 아닌 데이터 중독 공격에 대해서도 방어 능력을 보유함을 입증한다.

English

Large language models (LLMs) are highly susceptible to backdoor attacks (BAs), wherein training samples are poisoned using trigger-based harmful content. Furthermore, existing defenses have proven ineffective when extensively tested across BA patterns. To better combat BAs, we explore the use of LLM rewriting as a proactive defense against data poisoning. First, we theoretically show that when LLM rewriting utilizes open-book benign samples--termed open-book benign rewriting (OBBR)--the probability of a rewritten output being benign is strictly greater than that of closed-book rewriting. Thus, OBBR neutralizes harmful content by projecting training samples to the space of benign prompts. We then show that, in contrast to previous defenses, OBBR effectively mitigates a large number of existing BAs: across five known BAs and four widely used LLMs, OBBR increases safety performance by an average 51% compared to state-of-the-art BA defenses and 25.7% compared to closed-book rewriting methods. Finally, we show that OBBR is computationally efficient relative to other BA defenses, does not degrade model performance on natural language tasks after fine-tuning, and is capable of defending against non-trigger based data poisoning attacks.