中間層が知るもの：エントロピーダイナミクスによる脱獄検出

要旨

脱獄攻撃は、調整された大規模言語モデルにおける持続的な脆弱性を明らかにする。すなわち、慎重に作成されたプロンプトは、安全性の訓練にもかかわらずポリシー違反の応答を引き出し得る。ほとんどの防御策はプロンプトまたは出力レベルで機能するが、有害な意図がモデルの内部表現にどのように符号化されるかは未解明である。本研究では、凍結されたLLMの層を横断するトークンレベルの予測エントロピー軌跡をロジットレンズを用いて解析し、この問題を調査する。プロンプトレベルのエントロピーの静的な集約統計量（例：平均、分散）は識別信号をほとんど持たないのに対し、エントロピーがトークン位置間でどのように変化するかを捉える特徴量、例えば順位に基づく単調傾向スコアは、はるかに情報量が高いことが判明した。重要なことに、この信号はモデルの深さ全体で均一ではなく、中間層に集中し、最終層では減衰する。これは、脱獄に関連する構造が出力層ではなくネットワークの中間表現において最も顕著であることを示している。複数のモデル（Llama、Qwen、Gemma）と敵対的ベンチマークにわたり、これらのエントロピー動態は追加の訓練なしにアーキテクチャ間で一貫した分離を提供する。以上の知見は、脱獄動作が構造化された中間不確実性動態に反映されることを示し、有害な意図を符号化するエントロピー由来の特徴量と、その信号がネットワーク内で最も顕著になる位置を明確にする。

English

Jailbreak attacks reveal a persistent weakness in aligned Large Language Models: carefully crafted prompts can elicit policy-violating responses despite safety training. While most defenses operate at the prompt or output level, it remains unclear how harmful intent is encoded within the model's internal representations. We investigate this question by analyzing token-level predictive entropy trajectories across layers of a frozen LLM using the logit lens. We find that static aggregate statistics of prompt-level entropy (e.g., mean, variance) carry little discriminative signal, whereas features capturing how entropy evolves across token positions, such as monotonic rank-based trend scores, are substantially more informative. Importantly, this signal is not uniform across model depth: it is concentrated in intermediate layers and degrades at the final layer, indicating that jailbreak-relevant structure is most pronounced in mid-network representations rather than at the output head. Across multiple models (Llama, Qwen, Gemma) and adversarial benchmarks, these entropy dynamics provide architecture-consistent separation without additional training. Together, our findings show that jailbreak behavior is reflected in structured intermediate uncertainty dynamics, clarifying both which entropy-derived features encode harmful intent and where in the network that signal is most pronounced.