LLM 절멸 공격에 대한 놀라울 정도로 간단한 방어 전략

초록

대규모 언어 모델(LLM)은 일반적으로 유해한 지시를 거부함으로써 안전 가이드라인을 준수하도록 조정됩니다. 최근 등장한 'abliteration' 공격은 거부 행동에 가장 큰 영향을 미치는 단일 잠재 방향을 분리하고 억제하여 모델이 비윤리적인 콘텐츠를 생성하도록 만듭니다. 본 연구에서는 모델이 거부를 생성하는 방식을 수정하는 방어 기법을 제안합니다. 우리는 유해한 프롬프트와 이를 거부하는 이유를 설명하는 완전한 응답을 포함하는 확장된 거부 데이터셋을 구축했습니다. 이후 Llama-2-7B-Chat과 Qwen2.5-Instruct(1.5B 및 3B 파라미터)를 이 확장된 거부 데이터셋으로 미세 조정하고, 결과 시스템을 유해 프롬프트 세트에서 평가했습니다. 실험 결과, 확장된 거부 모델은 최대 10%까지 거부율이 감소하는 반면, 기준 모델은 abliteration 이후 70-80%까지 거부율이 하락했습니다. 안전성과 유용성에 대한 광범위한 평가를 통해, 확장된 거부 미세 조정이 abliteration 공격을 무력화하면서도 일반적인 성능을 유지함을 확인했습니다.

English

Large language models (LLMs) are typically aligned to comply with safety guidelines by refusing harmful instructions. A recent attack, termed abliteration, isolates and suppresses the single latent direction most responsible for refusal behavior, enabling the model to generate unethical content. We propose a defense that modifies how models generate refusals. We construct an extended-refusal dataset that contains harmful prompts with a full response that justifies the reason for refusal. We then fine-tune Llama-2-7B-Chat and Qwen2.5-Instruct (1.5B and 3B parameters) on our extended-refusal dataset, and evaluate the resulting systems on a set of harmful prompts. In our experiments, extended-refusal models maintain high refusal rates, dropping at most by 10%, whereas baseline models' refusal rates drop by 70-80% after abliteration. A broad evaluation of safety and utility shows that extended-refusal fine-tuning neutralizes the abliteration attack while preserving general performance.

LLM 절멸 공격에 대한 놀라울 정도로 간단한 방어 전략

An Embarrassingly Simple Defense Against LLM Abliteration Attacks

초록

Support