행동 안전 평가가 실패할 때: 표현 수준 관점

초록

대규모 언어 모델(LLM) 안전성은 종종 행동 수준에서 평가되어 왔는데, 이러한 평가는 개입 하에서의 표현 수준 취약성이 아닌 출력을 대상으로 하므로 내부 견고성에 대한 제한적인 증거만을 제공한다. 우리는 이러한 불일치를 감사 격차(audit gap)로 정식화한다: 이는 행동 안전성과 개입 하에서의 견고성 간의 차이이다. 이 격차를 연구하기 위해, 우리는 잠재 공간에서 취약한 상태를 유지하면서 안전한 외부 행동을 보존하는 분리 모델(dissociated models)을 구축한다. 우리는 유해한 미세 조정 및 계층별 잠재 변동을 포함한 매개변수 및 잠재 공간에서의 소프트 개입을 통해 모델 견고성을 테스트하기 위한 개입 기반 평가 프레임워크를 도입한다. 평가를 정식화하기 위해, 우리는 제한된 잠재 변동에 의해 유해한 행동이 얼마나 쉽게 유발될 수 있는지 측정하는 잠재 취약성 점수(LVS)를 제안한다. 이 평가 프레임워크를 사용하여, 우리는 여러 안전하게 정렬된 및 안전하지 않게 정렬된 최첨단 모델에서 행동 안전성 지표가 표현 수준 견고성의 충분한 측정치가 아님을 보여준다. 특히, 분리 모델은 유해한 개입 하에서 유사한 거부 행동에도 불구하고 현저히 높아진 LVS를 보이며, 중간 표현이 개입에 가장 민감하다. 우리의 결과는 행동 안전성 평가만으로는 모델 견고성에 대한 불완전한 그림을 제공하며, 잠재 취약성과 관찰 가능한 행동에 대한 표현 인식 감사(representation-aware audits)를 고려하게 한다.

English

Large Language Model (LLM) safety has often been evaluated at the behavior level, which provides limited evidence of internal robustness, as these evaluations target outputs rather than representation-level vulnerability under intervention. We formalize this discrepancy as the audit gap: the difference between behavioral safety and robustness under intervention. To study this gap, we construct dissociated models that preserve safe outward behavior while remaining vulnerable in the latent space. We introduce an intervention-based evaluation framework to test model robustness through soft interventions in parameter and latent spaces, including harmful fine-tuning and layer-wise latent perturbations. To formalize the evaluation, we propose the Latent Vulnerability Score (LVS) to measure how easily harmful behavior can be elicited by bounded latent perturbations. Using this evaluation framework, we show that behavioral safety metrics are insufficient measures of representation-level robustness across multiple safely and unsafely aligned state-of-the-art models. Notably, dissociated models show substantially elevated LVSs despite comparable refusal behavior under harmful intervention, with intermediate representations being the most sensitive to intervention. Our results suggest that behavioral safety evaluation alone provides an incomplete picture of model robustness, motivating representation-aware audits of latent vulnerability and observable behavior.