探索稀疏自编码器鲁棒性之源

摘要

大型语言模型（LLMs）仍易受基于优化的越狱攻击影响，这类攻击会利用模型内部的梯度结构。尽管稀疏自编码器（SAEs）被广泛用于可解释性研究，但其在鲁棒性方面的作用尚未得到充分探索。我们提出一项研究：在推理阶段将预训练的SAEs集成到Transformer残差流中，且不修改模型权重或阻断梯度。在涵盖四种模型架构（Gemma、LLaMA、Mistral、Qwen）和两种强白盒攻击（GCG、BEAST）及三项黑盒基准测试中，SAE增强模型相较于无防御基线实现了最高5倍的越狱成功率降低，并减少了跨模型攻击的可迁移性。参数消融实验表明：（i）L0稀疏度与攻击成功率存在单调剂量效应关系；（ii）存在层依赖的防御-效用权衡，其中中间层能平衡鲁棒性与正常性能。这些发现与表征瓶颈假说一致：稀疏投影重构了被越狱攻击利用的优化几何空间。

English

Large Language Models (LLMs) remain vulnerable to optimization-based jailbreak attacks that exploit internal gradient structure. While Sparse Autoencoders (SAEs) are widely used for interpretability, their robustness implications remain underexplored. We present a study of integrating pretrained SAEs into transformer residual streams at inference time, without modifying model weights or blocking gradients. Across four model families (Gemma, LLaMA, Mistral, Qwen) and two strong white-box attacks (GCG, BEAST) plus three black-box benchmarks, SAE-augmented models achieve up to a 5x reduction in jailbreak success rate relative to the undefended baseline and reduce cross-model attack transferability. Parametric ablations reveal (i) a monotonic dose-response relationship between L0 sparsity and attack success rate, and (ii) a layer-dependent defense-utility tradeoff, where intermediate layers balance robustness and clean performance. These findings are consistent with a representational bottleneck hypothesis: sparse projection reshapes the optimization geometry exploited by jailbreak attacks.