SAE干預不可靠：干預後被抑制行為的恢復

摘要

稀疏自編碼器（Sparse Autoencoders, SAEs）能將殘差流激活分解為可解釋的特徵。近期潛在空間防禦機制日益依賴此類分解，假設已識別的「不安全」SAE特徵可作為監控與干預的可操作處理點。在此範式中，預期壓制特定有害特徵能可靠防止模型失常行為。然而，我們證明此成功背後可能隱藏一種可恢復的失效模式：壓制操作可能阻斷通往某行為的可見路徑，卻未消除該行為本身。我們將此漏洞形式化為「干預後恢復」——一個受約束的殘差空間優化問題。從干預後的殘差狀態出發，我們優化殘差擾動，以恢復干預前的行為，同時保持目標SAE特徵在干預後的值。即使在強威脅模型（干預在優化與生成過程中持續生效）下，恢復依然可能。為排除恢復僅是撤銷干預的可能性，我們對單層干預採用編碼器正交更新，並在跨層情境中運用對應的特徵映射雅可比矩陣。在TPP、遺忘學習、IOI與拒絕引導等實驗中，此壓力測試揭示了即使成功達成特徵層級干預，行為仍可恢復。尤其在安全至關重要的拒絕引導設定中，我們在有效樣本上實現了95.8%的恢復率，同時將受防禦特徵的相對漂移控制在0.131，顯著低於基於後綴的基線方法。恢復路徑歸因分析進一步將此恢復定位至SAE重建殘差——即SAE未能解釋的成分。這些結果暴露了特徵層級控制與行為完整性之間的差距：SAE特徵能支援因果干預，但控制特徵不足以保證對底層行為的控制。

English

Sparse Autoencoders (SAEs) decompose residual-stream activations into interpretable features. Recent latent-space defenses increasingly rely on these decompositions, assuming that identified "unsafe" SAE features serve as actionable handles for monitoring and intervention. In this paradigm, clamping a specific harmful feature is expected to reliably prevent model misbehavior. However, we show that this success may hide a recoverable failure mode: the clamp may block one visible route to a behavior without eliminating the behavior itself. We formulate this vulnerability as post-intervention recovery, a constrained residual-space optimization problem. Starting from the post-intervention residual state, we optimize residual perturbations to recover the pre-intervention behavior while preserving the post-intervention values of the targeted SAE features. Even under a strong threat model where the intervention remains active throughout optimization and generation, recovery remains possible. To rule out that recovery simply undoes the intervention, we use encoder-orthogonal updates for single-layer interventions and the corresponding feature-map Jacobian in the cross-layer setting. Across TPP, unlearning, IOI, and refusal steering experiments, this stress test reveals recoverable behavior despite successful feature-level intervention. Especially in the safety-critical refusal-steering setting, we achieve a 95.8% recovery rate on valid samples while keeping defended-feature relative drift to 0.131, substantially below suffix-based baselines. A recovery-path attribution analysis further localizes this recovery to the SAE reconstruction residual, the component left unexplained by the SAE. These results expose a gap between feature-level control and behavioral completeness: SAE features can support causal intervention, but controlling them does not guarantee control over the underlying behavior.