SAE干预不可靠：干预后被抑制行为的恢复

摘要

稀疏自编码器（SAE）将残差流激活分解为可解释特征。近期基于潜在空间的防御方法日益依赖于这种分解，假设被识别为“不安全”的SAE特征可作为可操作的监视与干预把手。在此范式下，预期通过钳制特定有害特征即可可靠地防止模型不当行为。然而，我们表明这种成功可能隐藏着一个可恢复的失效模式：钳制可能阻断通向某种行为的某条可见路径，却并未消除行为本身。我们将这一脆弱性形式化为“干预后恢复”——一个约束残差空间优化问题。从干预后的残差状态出发，我们优化残差扰动，以恢复干预前的行为，同时保持目标SAE特征在干预后的取值。即使在强威胁模型下（干预在优化和生成过程中始终生效），恢复仍然可能实现。为排除恢复仅仅是撤销干预的可能性，我们在单层干预中使用编码器正交更新，在跨层设置中使用对应的特征图雅可比矩阵。在TPP、遗忘学习、IOI及拒绝指导实验中的压力测试表明，尽管在特征层面干预成功，但行为依然可恢复。尤其是在安全关键的拒绝指导场景中，我们在有效样本上实现了95.8%的恢复率，同时将被防御特征的相对漂移控制在0.131，显著低于基于后缀的基线。进一步的恢复路径归因分析将这种恢复定位到SAE重构残差——即SAE未能解释的成分。这些结果揭示了特征层面控制与行为完备性之间的差距：SAE特征能够支持因果干预，但控制它们并不能保证对底层行为的控制。

English

Sparse Autoencoders (SAEs) decompose residual-stream activations into interpretable features. Recent latent-space defenses increasingly rely on these decompositions, assuming that identified "unsafe" SAE features serve as actionable handles for monitoring and intervention. In this paradigm, clamping a specific harmful feature is expected to reliably prevent model misbehavior. However, we show that this success may hide a recoverable failure mode: the clamp may block one visible route to a behavior without eliminating the behavior itself. We formulate this vulnerability as post-intervention recovery, a constrained residual-space optimization problem. Starting from the post-intervention residual state, we optimize residual perturbations to recover the pre-intervention behavior while preserving the post-intervention values of the targeted SAE features. Even under a strong threat model where the intervention remains active throughout optimization and generation, recovery remains possible. To rule out that recovery simply undoes the intervention, we use encoder-orthogonal updates for single-layer interventions and the corresponding feature-map Jacobian in the cross-layer setting. Across TPP, unlearning, IOI, and refusal steering experiments, this stress test reveals recoverable behavior despite successful feature-level intervention. Especially in the safety-critical refusal-steering setting, we achieve a 95.8% recovery rate on valid samples while keeping defended-feature relative drift to 0.131, substantially below suffix-based baselines. A recovery-path attribution analysis further localizes this recovery to the SAE reconstruction residual, the component left unexplained by the SAE. These results expose a gap between feature-level control and behavioral completeness: SAE features can support causal intervention, but controlling them does not guarantee control over the underlying behavior.