當行為安全評估失敗時：表徵層級視角

摘要

大型語言模型（LLM）的安全性通常僅在行為層面進行評估，這對於內部魯棒性提供的證據有限，因為這類評估針對的是輸出結果，而非干預下的表徵層級脆弱性。我們將此差異正式定義為「審計鴻溝」：即行為安全性與干預下魯棒性之間的差距。為研究此鴻溝，我們構建了解耦模型，此類模型能維持安全的外顯行為，同時在潛在空間中仍保持脆弱性。我們提出一套基於干預的評估框架，透過在參數與潛在空間中進行軟干預（包括有害微調與逐層潛在擾動）來測試模型魯棒性。為使評估形式化，我們提出「潛在脆弱性評分」（LVS），用以衡量在有限潛在擾動下，有害行為被誘發的容易程度。利用此評估框架，我們證實了在多個安全與不安全對齊的當前最佳模型中，行為安全指標不足以反映表徵層級的魯棒性。值得注意的是，解耦模型在有害干預下雖表現出可比的拒絕行為，但其LVS卻顯著升高，其中中間表徵層對干預最為敏感。我們的結果表明，僅依靠行為安全性評估無法完整呈現模型魯棒性，因而需要進行表徵感知的審計，同時關注潛在脆弱性與可觀測行為。

English

Large Language Model (LLM) safety has often been evaluated at the behavior level, which provides limited evidence of internal robustness, as these evaluations target outputs rather than representation-level vulnerability under intervention. We formalize this discrepancy as the audit gap: the difference between behavioral safety and robustness under intervention. To study this gap, we construct dissociated models that preserve safe outward behavior while remaining vulnerable in the latent space. We introduce an intervention-based evaluation framework to test model robustness through soft interventions in parameter and latent spaces, including harmful fine-tuning and layer-wise latent perturbations. To formalize the evaluation, we propose the Latent Vulnerability Score (LVS) to measure how easily harmful behavior can be elicited by bounded latent perturbations. Using this evaluation framework, we show that behavioral safety metrics are insufficient measures of representation-level robustness across multiple safely and unsafely aligned state-of-the-art models. Notably, dissociated models show substantially elevated LVSs despite comparable refusal behavior under harmful intervention, with intermediate representations being the most sensitive to intervention. Our results suggest that behavioral safety evaluation alone provides an incomplete picture of model robustness, motivating representation-aware audits of latent vulnerability and observable behavior.