의도 세탁: AI 안전 데이터셋의 숨겨진 진실

초록

우리는 널리 사용되는 AI 안전 데이터셋의 품질을 두 가지 관점에서 체계적으로 평가한다: 단독 평가와 실전 평가다. 단독 평가에서는 이러한 데이터셋이 실제 적대적 공격을 얼마나 잘 반영하는지 세 가지 핵심 속성(은폐된 의도를 동기로 함, 정교하게 제작됨, 분포 외 데이터 특성)을 기준으로 검토한다. 분석 결과, 이 데이터셋들은 안전 메커니즘을 명시적으로 작동시키도록 의도된 노골적인 부정/민감 함의를 지닌 단어나 구절인 "촉발 신호"에 지나치게 의존하며, 이는 실제 공격과 비교해 비현실적임을 발견했다. 실전 평가에서는 이 데이터셋이 진정으로 안전 위험을 측정하는지, 아니면 단순히 촉발 신호를 통해 모델의 거부 반응만 유발하는지 평가한다. 이를 탐구하기 위해 우리는 "의도 세탁" 기법을 도입한다: 이는 적대적 공격(데이터 포인트)에서 촉발 신호를 추상화하여 제거하는 동시에 그 악의적 의도와 모든 관련 세부 사항을 엄격히 보존하는 절차다. 우리의 결과는 현재의 AI 안전 데이터셋이 촉발 신호에 대한 과도한 의존으로 인해 실제 적대적 행동을 충실히 반영하지 못함을 보여준다. 이러한 신호가 제거되면, 이전에 "합리적으로 안전하다" 평가받았던 모든 모델(Gemini 3 Pro 및 Claude Sonnet 3.7 포함)이 불안전해진다. 더 나아가, 의도 세탁을 탈옥 기법으로 적용할 경우 완전한 블랙박스 접근 조건에서 90%에서 98% 이상의 높은 공격 성공률을 지속적으로 달성한다. 전반적으로, 우리의 연구 결과는 기존 데이터셋에 의한 모델 안전성 평가 방식과 실제 공격자의 행동 방식 사이에 상당한 괴리가 있음을 폭로한다.

English

We systematically evaluate the quality of widely used AI safety datasets from two perspectives: in isolation and in practice. In isolation, we examine how well these datasets reflect real-world adversarial attacks based on three key properties: being driven by ulterior intent, well-crafted, and out-of-distribution. We find that these datasets overrely on "triggering cues": words or phrases with overt negative/sensitive connotations that are intended to trigger safety mechanisms explicitly, which is unrealistic compared to real-world attacks. In practice, we evaluate whether these datasets genuinely measure safety risks or merely provoke refusals through triggering cues. To explore this, we introduce "intent laundering": a procedure that abstracts away triggering cues from adversarial attacks (data points) while strictly preserving their malicious intent and all relevant details. Our results indicate that current AI safety datasets fail to faithfully represent real-world adversarial behavior due to their overreliance on triggering cues. Once these cues are removed, all previously evaluated "reasonably safe" models become unsafe, including Gemini 3 Pro and Claude Sonnet 3.7. Moreover, when intent laundering is adapted as a jailbreaking technique, it consistently achieves high attack success rates, ranging from 90% to over 98%, under fully black-box access. Overall, our findings expose a significant disconnect between how model safety is evaluated by existing datasets and how real-world adversaries behave.

의도 세탁: AI 안전 데이터셋의 숨겨진 진실

Intent Laundering: AI Safety Datasets Are Not What They Seem

초록

Support