当行为安全评估失败时：一个表征层面的视角

摘要

大型语言模型（LLM）的安全性通常基于行为层面进行评估，但这仅能提供有限的内部鲁棒性证据，因为此类评估关注的是输出本身，而非干预下表示层的脆弱性。我们将这一不一致性形式化为“审计差距”：即行为安全与干预下鲁棒性之间的差异。为研究这一差距，我们构建了“解离模型”，这些模型在保持表面安全行为的同时，内在潜在空间仍存在脆弱性。我们提出了一种基于干预的评估框架，通过在参数空间和潜在空间中实施软干预（包括有害微调和逐层潜在扰动）来测试模型鲁棒性。为形式化评估，我们提出了“潜在脆弱性评分”（LVS），用于衡量在有限潜在扰动下有害行为被诱发的难易程度。利用该评估框架，我们证明在多种安全对齐与未安全对齐的先进模型中，行为安全指标不足以表征表示层的鲁棒性。值得注意的是，解离模型在面对有害干预时，尽管拒绝行为表现相当，但其LVS显著升高，且中间表示层对干预最为敏感。我们的结果表明，仅凭行为安全评估无法全面反映模型鲁棒性，这促使我们需要进行面向表示层的审计，同时关注潜在脆弱性与可观测行为。

English

Large Language Model (LLM) safety has often been evaluated at the behavior level, which provides limited evidence of internal robustness, as these evaluations target outputs rather than representation-level vulnerability under intervention. We formalize this discrepancy as the audit gap: the difference between behavioral safety and robustness under intervention. To study this gap, we construct dissociated models that preserve safe outward behavior while remaining vulnerable in the latent space. We introduce an intervention-based evaluation framework to test model robustness through soft interventions in parameter and latent spaces, including harmful fine-tuning and layer-wise latent perturbations. To formalize the evaluation, we propose the Latent Vulnerability Score (LVS) to measure how easily harmful behavior can be elicited by bounded latent perturbations. Using this evaluation framework, we show that behavioral safety metrics are insufficient measures of representation-level robustness across multiple safely and unsafely aligned state-of-the-art models. Notably, dissociated models show substantially elevated LVSs despite comparable refusal behavior under harmful intervention, with intermediate representations being the most sensitive to intervention. Our results suggest that behavioral safety evaluation alone provides an incomplete picture of model robustness, motivating representation-aware audits of latent vulnerability and observable behavior.