기계적 해석 가능성을 활용한 대규모 언어 모델에 대한 적대적 공격 기법 개발

초록

LLM(대형 언어 모델)에 대한 적대적 공격을 생성하기 위한 전통적인 화이트박스 방법은 일반적으로 대상 모델의 그래디언트 계산에만 의존하며, 공격의 성공 또는 실패를 책임지는 내부 메커니즘을 무시합니다. 반면, 이러한 내부 메커니즘을 분석하는 해석 가능성 연구는 런타임 개입을 넘어서는 실질적인 응용이 부족합니다. 우리는 이 간극을 메우기 위해 기계적 해석 가능성 기법을 활용하여 실질적인 적대적 입력을 생성하는 새로운 화이트박스 접근 방식을 소개합니다. 구체적으로, 우리는 먼저 모델의 거부 메커니즘을 트리거하지 않는 특성 벡터 집합인 '수용 서브스페이스'를 식별한 다음, 그래디언트 기반 최적화를 사용하여 임베딩을 거부 서브스페이스에서 수용 서브스페이스로 재라우팅하여 효과적으로 '탈옥(jailbreak)'을 달성합니다. 이 표적화된 접근 방식은 계산 비용을 크게 줄이며, Gemma2, Llama3.2, Qwen2.5과 같은 최신 모델에서 80-95%의 공격 성공률을 몇 분 또는 몇 초 만에 달성합니다. 이는 기존 기술이 종종 실패하거나 수 시간의 계산을 요구하는 것과 대조적입니다. 우리는 이 접근 방식이 공격 연구와 방어 개발 모두에 새로운 방향을 제시한다고 믿습니다. 더 나아가, 이는 다른 방법들이 덜 효율적인 상황에서 기계적 해석 가능성의 실질적인 응용을 보여주며, 그 유용성을 강조합니다. 코드와 생성된 데이터셋은 https://github.com/Sckathach/subspace-rerouting에서 확인할 수 있습니다.

English

Traditional white-box methods for creating adversarial perturbations against LLMs typically rely only on gradient computation from the targeted model, ignoring the internal mechanisms responsible for attack success or failure. Conversely, interpretability studies that analyze these internal mechanisms lack practical applications beyond runtime interventions. We bridge this gap by introducing a novel white-box approach that leverages mechanistic interpretability techniques to craft practical adversarial inputs. Specifically, we first identify acceptance subspaces - sets of feature vectors that do not trigger the model's refusal mechanisms - then use gradient-based optimization to reroute embeddings from refusal subspaces to acceptance subspaces, effectively achieving jailbreaks. This targeted approach significantly reduces computation cost, achieving attack success rates of 80-95\% on state-of-the-art models including Gemma2, Llama3.2, and Qwen2.5 within minutes or even seconds, compared to existing techniques that often fail or require hours of computation. We believe this approach opens a new direction for both attack research and defense development. Furthermore, it showcases a practical application of mechanistic interpretability where other methods are less efficient, which highlights its utility. The code and generated datasets are available at https://github.com/Sckathach/subspace-rerouting.

기계적 해석 가능성을 활용한 대규모 언어 모델에 대한 적대적 공격 기법 개발

Using Mechanistic Interpretability to Craft Adversarial Attacks against Large Language Models

초록

Support