利用機制可解釋性來構建針對大型語言模型的對抗攻擊
Using Mechanistic Interpretability to Craft Adversarial Attacks against Large Language Models
March 8, 2025
作者: Thomas Winninger, Boussad Addad, Katarzyna Kapusta
cs.AI
摘要
傳統上,針對大型語言模型(LLMs)生成對抗性擾動的白盒方法通常僅依賴於目標模型的梯度計算,而忽略了導致攻擊成功或失敗的內部機制。相反,分析這些內部機制的可解釋性研究則缺乏超越運行時干預的實際應用。我們通過引入一種新穎的白盒方法來彌合這一差距,該方法利用機制可解釋性技術來構建實用的對抗性輸入。具體而言,我們首先識別接受子空間——即不會觸發模型拒絕機制的特徵向量集合——然後使用基於梯度的優化將嵌入從拒絕子空間重新路由到接受子空間,從而有效實現越獄。這種針對性方法顯著降低了計算成本,在包括Gemma2、Llama3.2和Qwen2.5在內的最新模型上,攻擊成功率達到80-95%,且僅需幾分鐘甚至幾秒鐘,而現有技術往往失敗或需要數小時的計算。我們相信這種方法為攻擊研究和防禦開發開闢了新的方向。此外,它展示了機制可解釋性在其他方法效率較低時的實際應用,突顯了其效用。代碼和生成的數據集可在https://github.com/Sckathach/subspace-rerouting獲取。
English
Traditional white-box methods for creating adversarial perturbations against
LLMs typically rely only on gradient computation from the targeted model,
ignoring the internal mechanisms responsible for attack success or failure.
Conversely, interpretability studies that analyze these internal mechanisms
lack practical applications beyond runtime interventions. We bridge this gap by
introducing a novel white-box approach that leverages mechanistic
interpretability techniques to craft practical adversarial inputs.
Specifically, we first identify acceptance subspaces - sets of feature vectors
that do not trigger the model's refusal mechanisms - then use gradient-based
optimization to reroute embeddings from refusal subspaces to acceptance
subspaces, effectively achieving jailbreaks. This targeted approach
significantly reduces computation cost, achieving attack success rates of
80-95\% on state-of-the-art models including Gemma2, Llama3.2, and Qwen2.5
within minutes or even seconds, compared to existing techniques that often fail
or require hours of computation. We believe this approach opens a new direction
for both attack research and defense development. Furthermore, it showcases a
practical application of mechanistic interpretability where other methods are
less efficient, which highlights its utility. The code and generated datasets
are available at https://github.com/Sckathach/subspace-rerouting.Summary
AI-Generated Summary