중간층이 인지하는 것: 엔트로피 동역학을 통한 탈옥 탐지

초록

탈옥 공격은 정렬된 대규모 언어 모델의 지속적인 취약점을 드러낸다. 즉, 안전 훈련을 거쳤음에도 정교하게 제작된 프롬프트가 정책 위반 응답을 유도할 수 있다. 대부분의 방어 기법은 프롬프트 또는 출력 수준에서 작동하지만, 유해한 의도가 모델의 내부 표현에 어떻게 인코딩되는지는 여전히 불분명하다. 우리는 로짓 렌즈를 사용하여 고정된 LLM의 계층별 토큰 수준 예측 엔트로피 궤적을 분석함으로써 이 문제를 조사한다. 프롬프트 수준 엔트로피의 정적 집계 통계(예: 평균, 분산)는 식별 신호를 거의 전달하지 않는 반면, 엔트로피가 토큰 위치에 따라 어떻게 진화하는지 포착하는 특징(예: 단조 순위 기반 추세 점수)은 훨씬 더 많은 정보를 제공한다는 사실을 발견했다. 중요한 점은 이 신호가 모델 깊이에 걸쳐 균일하지 않으며, 중간 계층에 집중되고 최종 계층에서는 약화된다는 것이다. 이는 탈옥 관련 구조가 출력 헤드보다는 네트워크 중간 표현에서 가장 두드러짐을 시사한다. 여러 모델(Llama, Qwen, Gemma)과 적대적 벤치마크에 걸쳐 이러한 엔트로피 동역학은 추가 훈련 없이 아키텍처에 일관된 분리를 제공한다. 종합적으로, 우리의 연구 결과는 탈옥 행동이 구조화된 중간 불확실성 동역학에 반영되며, 어떤 엔트로피 기반 특징이 유해한 의도를 인코딩하는지, 그리고 그 신호가 네트워크의 어디에서 가장 두드러지는지를 명확히 한다.

English

Jailbreak attacks reveal a persistent weakness in aligned Large Language Models: carefully crafted prompts can elicit policy-violating responses despite safety training. While most defenses operate at the prompt or output level, it remains unclear how harmful intent is encoded within the model's internal representations. We investigate this question by analyzing token-level predictive entropy trajectories across layers of a frozen LLM using the logit lens. We find that static aggregate statistics of prompt-level entropy (e.g., mean, variance) carry little discriminative signal, whereas features capturing how entropy evolves across token positions, such as monotonic rank-based trend scores, are substantially more informative. Importantly, this signal is not uniform across model depth: it is concentrated in intermediate layers and degrades at the final layer, indicating that jailbreak-relevant structure is most pronounced in mid-network representations rather than at the output head. Across multiple models (Llama, Qwen, Gemma) and adversarial benchmarks, these entropy dynamics provide architecture-consistent separation without additional training. Together, our findings show that jailbreak behavior is reflected in structured intermediate uncertainty dynamics, clarifying both which entropy-derived features encode harmful intent and where in the network that signal is most pronounced.