行動安全性評価が失敗する時：表現レベルの視点から

要旨

大規模言語モデル(LLM)の安全性は、これまで行動レベルで評価されることが多く、介入下での表現レベルの脆弱性ではなく出力を対象としているため、内部のロバスト性を示す証拠は限られていた。我々はこの不一致を「監査ギャップ」、すなわち行動的安全性と介入下でのロバスト性の差として定式化する。このギャップを研究するために、潜伏空間では脆弱でありながら安全な外見的行動を維持する「解離モデル」を構築する。また、有害なファインチューニングや層ごとの潜伏摂動を含む、パラメータ空間および潜伏空間におけるソフトな介入を通じてモデルのロバスト性をテストする、介入ベースの評価フレームワークを導入する。評価を定式化するために、有界な潜伏摂動によって有害行動がどれだけ容易に誘発されるかを測定する潜在脆弱性スコア(Latent Vulnerability Score, LVS)を提案する。この評価フレームワークを用いて、行動的安全性の指標が、安全および非安全に調整された複数の最先端モデルにおける表現レベルのロバスト性を測定するには不十分であることを示す。注目すべきことに、解離モデルは有害介入下で同等の拒否行動を示すにもかかわらずLVSが大幅に上昇しており、中間表現が介入に対して最も敏感である。我々の結果は、行動的安全性評価だけではモデルのロバスト性の不完全な像しか提供せず、潜伏脆弱性と観察可能な行動の両方を考慮した表現認識監査の必要性を示唆している。

English

Large Language Model (LLM) safety has often been evaluated at the behavior level, which provides limited evidence of internal robustness, as these evaluations target outputs rather than representation-level vulnerability under intervention. We formalize this discrepancy as the audit gap: the difference between behavioral safety and robustness under intervention. To study this gap, we construct dissociated models that preserve safe outward behavior while remaining vulnerable in the latent space. We introduce an intervention-based evaluation framework to test model robustness through soft interventions in parameter and latent spaces, including harmful fine-tuning and layer-wise latent perturbations. To formalize the evaluation, we propose the Latent Vulnerability Score (LVS) to measure how easily harmful behavior can be elicited by bounded latent perturbations. Using this evaluation framework, we show that behavioral safety metrics are insufficient measures of representation-level robustness across multiple safely and unsafely aligned state-of-the-art models. Notably, dissociated models show substantially elevated LVSs despite comparable refusal behavior under harmful intervention, with intermediate representations being the most sensitive to intervention. Our results suggest that behavioral safety evaluation alone provides an incomplete picture of model robustness, motivating representation-aware audits of latent vulnerability and observable behavior.