意図の隠蔽：AI安全性データセットの実像

要旨

我々は、広く使用されているAI安全性データセットの品質を、単体評価と実践評価の二つの観点から体系的に検証する。単体評価では、これらのデータセットが現実世界の敵対的攻撃をどれほど反映しているかを、隠された意図に基づくこと、巧妙に設計されていること、分布外であることという三つの主要特性に基づいて検証する。その結果、これらのデータセットが「トリガーキュー」——安全性機構を明示的に起動させることを意図した、露骨な否定的/敏感な含意を持つ単語やフレーズ——に過度に依存していることが判明した。これは現実世界の攻撃と比較して非現実的である。実践評価では、これらのデータセットが真に安全性リスクを測定しているのか、それとも単にトリガーキューを通じて拒否反応を引き起こしているだけなのかを評価する。これを探るため、我々は「意図洗浄」を導入する。これは、敵対的攻撃（データポイント）からトリガーキューを抽象化しつつ、その悪意ある意図と全ての関連する詳細を厳密に保持する手順である。結果は、現在のAI安全性データセットがトリガーキューへの過度な依存により、現実世界の敵対的行動を忠実に表現できていないことを示唆する。これらのキューが除去されると、以前に「合理的に安全」と評価されていたGemini 3 ProやClaude Sonnet 3.7を含む全てのモデルが不安全となる。さらに、意図洗浄がジャイルブレイキング技術として適用された場合、完全なブラックボックスアクセスの下で、90%から98%超という高い攻撃成功率を一貫して達成する。全体として、我々の発見は、既存のデータセットによるモデル安全性の評価方法と、現実世界の敵対者の行動様式との間に重大な隔たりが存在することを明らかにする。

English

We systematically evaluate the quality of widely used AI safety datasets from two perspectives: in isolation and in practice. In isolation, we examine how well these datasets reflect real-world adversarial attacks based on three key properties: being driven by ulterior intent, well-crafted, and out-of-distribution. We find that these datasets overrely on "triggering cues": words or phrases with overt negative/sensitive connotations that are intended to trigger safety mechanisms explicitly, which is unrealistic compared to real-world attacks. In practice, we evaluate whether these datasets genuinely measure safety risks or merely provoke refusals through triggering cues. To explore this, we introduce "intent laundering": a procedure that abstracts away triggering cues from adversarial attacks (data points) while strictly preserving their malicious intent and all relevant details. Our results indicate that current AI safety datasets fail to faithfully represent real-world adversarial behavior due to their overreliance on triggering cues. Once these cues are removed, all previously evaluated "reasonably safe" models become unsafe, including Gemini 3 Pro and Claude Sonnet 3.7. Moreover, when intent laundering is adapted as a jailbreaking technique, it consistently achieves high attack success rates, ranging from 90% to over 98%, under fully black-box access. Overall, our findings expose a significant disconnect between how model safety is evaluated by existing datasets and how real-world adversaries behave.

意図の隠蔽：AI安全性データセットの実像

Intent Laundering: AI Safety Datasets Are Not What They Seem

要旨

Support