利用機制可解釋性來構建針對大型語言模型的對抗攻擊

摘要

傳統上，針對大型語言模型（LLMs）生成對抗性擾動的白盒方法通常僅依賴於目標模型的梯度計算，而忽略了導致攻擊成功或失敗的內部機制。相反，分析這些內部機制的可解釋性研究則缺乏超越運行時干預的實際應用。我們通過引入一種新穎的白盒方法來彌合這一差距，該方法利用機制可解釋性技術來構建實用的對抗性輸入。具體而言，我們首先識別接受子空間——即不會觸發模型拒絕機制的特徵向量集合——然後使用基於梯度的優化將嵌入從拒絕子空間重新路由到接受子空間，從而有效實現越獄。這種針對性方法顯著降低了計算成本，在包括Gemma2、Llama3.2和Qwen2.5在內的最新模型上，攻擊成功率達到80-95%，且僅需幾分鐘甚至幾秒鐘，而現有技術往往失敗或需要數小時的計算。我們相信這種方法為攻擊研究和防禦開發開闢了新的方向。此外，它展示了機制可解釋性在其他方法效率較低時的實際應用，突顯了其效用。代碼和生成的數據集可在https://github.com/Sckathach/subspace-rerouting獲取。

English

Traditional white-box methods for creating adversarial perturbations against LLMs typically rely only on gradient computation from the targeted model, ignoring the internal mechanisms responsible for attack success or failure. Conversely, interpretability studies that analyze these internal mechanisms lack practical applications beyond runtime interventions. We bridge this gap by introducing a novel white-box approach that leverages mechanistic interpretability techniques to craft practical adversarial inputs. Specifically, we first identify acceptance subspaces - sets of feature vectors that do not trigger the model's refusal mechanisms - then use gradient-based optimization to reroute embeddings from refusal subspaces to acceptance subspaces, effectively achieving jailbreaks. This targeted approach significantly reduces computation cost, achieving attack success rates of 80-95\% on state-of-the-art models including Gemma2, Llama3.2, and Qwen2.5 within minutes or even seconds, compared to existing techniques that often fail or require hours of computation. We believe this approach opens a new direction for both attack research and defense development. Furthermore, it showcases a practical application of mechanistic interpretability where other methods are less efficient, which highlights its utility. The code and generated datasets are available at https://github.com/Sckathach/subspace-rerouting.

利用機制可解釋性來構建針對大型語言模型的對抗攻擊

Using Mechanistic Interpretability to Craft Adversarial Attacks against Large Language Models

摘要

Support