意图漂白:人工智能安全数据集名不副实
Intent Laundering: AI Safety Datasets Are Not What They Seem
February 17, 2026
作者: Shahriar Golchin, Marc Wetter
cs.AI
摘要
我们系统评估了当前广泛使用的AI安全数据集的质量,从孤立性和实践性两个维度展开分析。在孤立性层面,我们基于三个关键属性(受隐蔽意图驱动、精心构建性、分布外特性)检验这些数据集反映真实世界对抗攻击的准确度,发现其过度依赖"触发线索"——即那些带有明显负面/敏感含义、旨在显式触发安全机制的词语或短语,这与现实攻击模式存在显著差异。在实践性层面,我们通过引入"意图净化"方法(一种在严格保留恶意意图及所有相关细节的前提下,剥离对抗攻击数据点中触发线索的程序),验证这些数据集究竟是在真实衡量安全风险,还是仅通过触发线索引发模型拒绝。实验表明:由于对触发线索的过度依赖,现有AI安全数据集无法真实反映现实对抗行为。当移除这些线索后,所有先前评估为"相对安全"的模型(包括Gemini 3 Pro和Claude Sonnet 3.7)均表现出不安全特性。更值得注意的是,将意图净化技术适配为越狱攻击手段时,在完全黑盒访问条件下持续实现90%至98%以上的高攻击成功率。总体而言,我们的研究揭示了现有安全评估数据集与真实世界对抗行为之间存在根本性脱节。
English
We systematically evaluate the quality of widely used AI safety datasets from two perspectives: in isolation and in practice. In isolation, we examine how well these datasets reflect real-world adversarial attacks based on three key properties: being driven by ulterior intent, well-crafted, and out-of-distribution. We find that these datasets overrely on "triggering cues": words or phrases with overt negative/sensitive connotations that are intended to trigger safety mechanisms explicitly, which is unrealistic compared to real-world attacks. In practice, we evaluate whether these datasets genuinely measure safety risks or merely provoke refusals through triggering cues. To explore this, we introduce "intent laundering": a procedure that abstracts away triggering cues from adversarial attacks (data points) while strictly preserving their malicious intent and all relevant details. Our results indicate that current AI safety datasets fail to faithfully represent real-world adversarial behavior due to their overreliance on triggering cues. Once these cues are removed, all previously evaluated "reasonably safe" models become unsafe, including Gemini 3 Pro and Claude Sonnet 3.7. Moreover, when intent laundering is adapted as a jailbreaking technique, it consistently achieves high attack success rates, ranging from 90% to over 98%, under fully black-box access. Overall, our findings expose a significant disconnect between how model safety is evaluated by existing datasets and how real-world adversaries behave.