조정된 LLM 활성화는 비전사적이다

초록

활성화 조작(activation steering)은 모델의 활성화를 수정하여 행동의 추상적 변화를 유도하는 인기 있는 화이트박스 제어 기법이다. 또한 해석 가능성 연구(예: 진실성 탐색, 활성화를 인간이 읽을 수 있는 설명으로 변환) 및 안전 연구(예: 탈옥 가능성)에서 표준 도구로 자리 잡았다. 그러나 조작된 행동이 텍스트 프롬프트에 의해 실현 가능한지 여부는 명확하지 않다. 본 연구에서는 이 문제를 전사성(surjectivity) 문제로 정식화한다: 고정된 모델에 대해, 모든 조작된 활성화가 모델의 자연적 순방향 전파 하에서 원상(preimage)을 허용하는가? 실제적인 가정 하에, 우리는 활성화 조작이 잔차 스트림을 이산 프롬프트에서 도달 가능한 상태의 다양체(manifold)에서 벗어나게 밀어낸다는 것을 증명한다. 거의 확실히, 어떤 프롬프트도 조작에 의해 유도된 동일한 내부 행동을 재현할 수 없다. 또한 세 가지 널리 사용되는 LLM에서 실험적으로 이 결과를 확인한다. 본 연구 결과는 화이트박스 조작 가능성과 블랙박스 프롬프팅 간의 형식적 분리를 확립한다. 따라서 활성화 조작의 용이성과 성공을 프롬프트 기반 해석 가능성 또는 취약성의 증거로 해석하는 것에 대해 주의를 촉구하며, 화이트박스와 블랙박스 개입을 명시적으로 분리하는 평가 프로토콜을 제안한다.

English

Activation steering is a popular white-box control technique that modifies model activations to elicit an abstract change in its behavior. It has also become a standard tool in interpretability (e.g., probing truthfulness, or translating activations into human-readable explanations) and safety research (e.g., jailbreakability). However, it is unclear whether steered behavior is realizable by any textual prompt. In this work, we cast this question as a surjectivity problem: for a fixed model, does every steered activation admit a preimage under the model's natural forward pass? Under practical assumptions, we prove that activation steering pushes the residual stream off the manifold of states reachable from discrete prompts. Almost surely, no prompt can reproduce the same internal behavior induced by steering. We also illustrate this finding empirically across three widely used LLMs. Our results establish a formal separation between white-box steerability and black-box prompting. We therefore caution against interpreting the ease and success of activation steering as evidence of prompt-based interpretability or vulnerability, and argue for evaluation protocols that explicitly decouple white-box and black-box interventions.