中间层的认知：基于熵动态的越狱检测

摘要

越狱攻击揭示了经过对齐的大型语言模型存在一个持久性弱点：精心设计的提示能引发违反安全策略的响应，即使模型经过了安全训练。尽管大多数防御手段在提示或输出层面起作用，但有害意图如何在模型内部表征中被编码仍不明确。我们通过使用对数几率透镜分析冻结LLM各层的词元级预测熵轨迹来探究这一问题。研究发现，提示层面熵的静态聚合统计量（如均值、方差）携带的判别信号极弱，而刻画熵在词元位置间演化趋势的特征（例如基于单调排名的趋势分数）则更具信息量。重要的是，该信号在模型深度上并非均匀分布：它集中于中间层，并在最终层退化，表明越狱相关结构在中间网络表征中最为显著，而非输出头部。在多个模型（Llama、Qwen、Gemma）和对抗性基准测试中，这些熵动态特性无需额外训练即可提供架构一致的区分能力。综合来看，我们的发现表明越狱行为体现在结构化的中间层不确定性动态中，阐明了哪些熵衍生特征编码了有害意图，以及该信号在网络中何处最为显著。

English

Jailbreak attacks reveal a persistent weakness in aligned Large Language Models: carefully crafted prompts can elicit policy-violating responses despite safety training. While most defenses operate at the prompt or output level, it remains unclear how harmful intent is encoded within the model's internal representations. We investigate this question by analyzing token-level predictive entropy trajectories across layers of a frozen LLM using the logit lens. We find that static aggregate statistics of prompt-level entropy (e.g., mean, variance) carry little discriminative signal, whereas features capturing how entropy evolves across token positions, such as monotonic rank-based trend scores, are substantially more informative. Importantly, this signal is not uniform across model depth: it is concentrated in intermediate layers and degrades at the final layer, indicating that jailbreak-relevant structure is most pronounced in mid-network representations rather than at the output head. Across multiple models (Llama, Qwen, Gemma) and adversarial benchmarks, these entropy dynamics provide architecture-consistent separation without additional training. Together, our findings show that jailbreak behavior is reflected in structured intermediate uncertainty dynamics, clarifying both which entropy-derived features encode harmful intent and where in the network that signal is most pronounced.