SAE介入は信頼性に欠ける：介入後の抑制行動の回復

要旨

スパースオートエンコーダ（SAE）は、残差ストリームの活性化を解釈可能な特徴に分解する。最近の潜在空間防御は、これらの分解にますます依存しており、特定された「安全でない」SAE特徴が監視と介入のための実用的なハンドルとして機能するという前提に基づいている。このパラダイムでは、特定の有害な特徴をクランプすることで、モデルの不適切な動作を確実に防止できると期待されている。しかし、この成功が回復可能な障害モードを隠している可能性があることを示す。すなわち、クランプは行動への一つの可視経路を遮断するが、行動自体を排除するわけではない。この脆弱性を介入後回復として定式化する。これは制約付き残差空間最適化問題である。介入後の残差状態から出発し、対象のSAE特徴の介入後の値を維持しながら、介入前の行動を回復するように残差摂動を最適化する。最適化と生成全体を通じて介入がアクティブなままである強力な脅威モデルの下でも、回復は可能である。回復が単に介入を元に戻すものではないことを排除するために、単層介入にはエンコーダ直交更新を、層間設定には対応する特徴マップヤコビアンを使用する。TPP、アンラーニング、IOI、拒否ステアリングの各実験を通じて、このストレステストは、特徴レベルの介入が成功したにもかかわらず、回復可能な行動を明らかにする。特に安全上重要な拒否ステアリング設定では、防御された特徴の相対ドリフトを0.131に抑えつつ、有効サンプルに対して95.8%の回復率を達成しており、これはサフィックスベースのベースラインを大幅に下回る。回復経路帰属分析により、この回復はさらにSAE再構成残差、すなわちSAEによって説明されない成分に局所化される。これらの結果は、特徴レベルの制御と行動の完全性との間のギャップを明らかにする。SAE特徴は因果的介入をサポートできるが、それらを制御しても基礎となる行動を制御することは保証されない。

English

Sparse Autoencoders (SAEs) decompose residual-stream activations into interpretable features. Recent latent-space defenses increasingly rely on these decompositions, assuming that identified "unsafe" SAE features serve as actionable handles for monitoring and intervention. In this paradigm, clamping a specific harmful feature is expected to reliably prevent model misbehavior. However, we show that this success may hide a recoverable failure mode: the clamp may block one visible route to a behavior without eliminating the behavior itself. We formulate this vulnerability as post-intervention recovery, a constrained residual-space optimization problem. Starting from the post-intervention residual state, we optimize residual perturbations to recover the pre-intervention behavior while preserving the post-intervention values of the targeted SAE features. Even under a strong threat model where the intervention remains active throughout optimization and generation, recovery remains possible. To rule out that recovery simply undoes the intervention, we use encoder-orthogonal updates for single-layer interventions and the corresponding feature-map Jacobian in the cross-layer setting. Across TPP, unlearning, IOI, and refusal steering experiments, this stress test reveals recoverable behavior despite successful feature-level intervention. Especially in the safety-critical refusal-steering setting, we achieve a 95.8% recovery rate on valid samples while keeping defended-feature relative drift to 0.131, substantially below suffix-based baselines. A recovery-path attribution analysis further localizes this recovery to the SAE reconstruction residual, the component left unexplained by the SAE. These results expose a gap between feature-level control and behavioral completeness: SAE features can support causal intervention, but controlling them does not guarantee control over the underlying behavior.