利用机制可解释性构建针对大型语言模型的对抗性攻击

摘要

传统的针对大语言模型（LLM）生成对抗性扰动的白盒方法，通常仅依赖于目标模型的梯度计算，忽视了决定攻击成败的内部机制。相反，分析这些内部机制的可解释性研究，除了运行时干预外，缺乏实际应用。我们通过引入一种新颖的白盒方法，利用机制可解释性技术来构建实用的对抗性输入，从而弥合了这一差距。具体而言，我们首先识别接受子空间——即不会触发模型拒绝机制的特征向量集合，然后采用基于梯度的优化方法，将嵌入从拒绝子空间重新路由至接受子空间，有效实现越狱。这种针对性方法显著降低了计算成本，在包括Gemma2、Llama3.2和Qwen2.5在内的最新模型上，攻击成功率达到了80-95%，且仅需几分钟甚至几秒，而现有技术往往失败或需要数小时计算。我们相信，这一方法为攻击研究和防御开发开辟了新方向。此外，它展示了机制可解释性在其他方法效率较低时的实际应用，凸显了其实用价值。代码及生成的数据集可在https://github.com/Sckathach/subspace-rerouting获取。

English

Traditional white-box methods for creating adversarial perturbations against LLMs typically rely only on gradient computation from the targeted model, ignoring the internal mechanisms responsible for attack success or failure. Conversely, interpretability studies that analyze these internal mechanisms lack practical applications beyond runtime interventions. We bridge this gap by introducing a novel white-box approach that leverages mechanistic interpretability techniques to craft practical adversarial inputs. Specifically, we first identify acceptance subspaces - sets of feature vectors that do not trigger the model's refusal mechanisms - then use gradient-based optimization to reroute embeddings from refusal subspaces to acceptance subspaces, effectively achieving jailbreaks. This targeted approach significantly reduces computation cost, achieving attack success rates of 80-95\% on state-of-the-art models including Gemma2, Llama3.2, and Qwen2.5 within minutes or even seconds, compared to existing techniques that often fail or require hours of computation. We believe this approach opens a new direction for both attack research and defense development. Furthermore, it showcases a practical application of mechanistic interpretability where other methods are less efficient, which highlights its utility. The code and generated datasets are available at https://github.com/Sckathach/subspace-rerouting.

利用机制可解释性构建针对大型语言模型的对抗性攻击

Using Mechanistic Interpretability to Craft Adversarial Attacks against Large Language Models

摘要

Support