SAE 개입은 신뢰할 수 없음: 개입 후 억제된 행동의 회복

초록

희소 오토인코더(Sparse Autoencoders, SAEs)는 잔차 스트림 활성화를 해석 가능한 특징으로 분해한다. 최근 잠재 공간 방어는 이러한 분해에 점점 더 의존하며, 식별된 "안전하지 않은" SAE 특징이 모니터링 및 개입을 위한 실행 가능한 핸들 역할을 한다고 가정한다. 이러한 패러다임에서 특정 유해 특징을 고정(clamping)하면 모델의 오작동을 안정적으로 방지할 것으로 기대된다. 그러나 우리는 이러한 성공이 회복 가능한 실패 모드를 숨길 수 있음을 보여준다: 고정은 행동 자체를 제거하지 않으면서 행동으로 가는 하나의 가시적 경로를 차단할 수 있다. 우리는 이 취약점을 개입 후 회복(post-intervention recovery), 즉 제약된 잔차 공간 최적화 문제로 정식화한다. 개입 후 잔차 상태에서 시작하여, 우리는 잔차 섭동을 최적화하여 목표로 하는 SAE 특징의 개입 후 값을 유지하면서 개입 전 행동을 회복시킨다. 개입이 최적화 및 생성 전반에 걸쳐 활성 상태로 유지되는 강력한 위협 모델 하에서도 회복은 여전히 가능하다. 회복이 단순히 개입을 취소하는 것이 아님을 배제하기 위해, 단일 계층 개입에는 인코더 직교 업데이트를, 교차 계층 설정에는 해당 특징 맵 야코비안을 사용한다. TPP, 언러닝, IOI 및 거부 조종 실험 전반에 걸쳐, 이 스트레스 테스트는 특징 수준 개입이 성공했음에도 불구하고 회복 가능한 행동을 드러낸다. 특히 안전에 중요한 거부 조종 설정에서는 유효 샘플에 대해 95.8%의 회복률을 달성하면서 방어된 특징의 상대 드리프트를 0.131로 유지하여, 접미사 기반 기준선보다 현저히 낮은 수준을 보였다. 회복 경로 기여도 분석을 통해 이 회복을 SAE 재구성 잔차, 즉 SAE가 설명하지 못한 구성 요소에 국한시킨다. 이러한 결과는 특징 수준 제어와 행동 완전성 사이의 간극을 드러낸다: SAE 특징은 인과적 개입을 지원할 수 있지만, 이를 제어한다고 해서 기저 행동에 대한 제어가 보장되지는 않는다.

English

Sparse Autoencoders (SAEs) decompose residual-stream activations into interpretable features. Recent latent-space defenses increasingly rely on these decompositions, assuming that identified "unsafe" SAE features serve as actionable handles for monitoring and intervention. In this paradigm, clamping a specific harmful feature is expected to reliably prevent model misbehavior. However, we show that this success may hide a recoverable failure mode: the clamp may block one visible route to a behavior without eliminating the behavior itself. We formulate this vulnerability as post-intervention recovery, a constrained residual-space optimization problem. Starting from the post-intervention residual state, we optimize residual perturbations to recover the pre-intervention behavior while preserving the post-intervention values of the targeted SAE features. Even under a strong threat model where the intervention remains active throughout optimization and generation, recovery remains possible. To rule out that recovery simply undoes the intervention, we use encoder-orthogonal updates for single-layer interventions and the corresponding feature-map Jacobian in the cross-layer setting. Across TPP, unlearning, IOI, and refusal steering experiments, this stress test reveals recoverable behavior despite successful feature-level intervention. Especially in the safety-critical refusal-steering setting, we achieve a 95.8% recovery rate on valid samples while keeping defended-feature relative drift to 0.131, substantially below suffix-based baselines. A recovery-path attribution analysis further localizes this recovery to the SAE reconstruction residual, the component left unexplained by the SAE. These results expose a gap between feature-level control and behavioral completeness: SAE features can support causal intervention, but controlling them does not guarantee control over the underlying behavior.