典型を信頼せよ

要旨

現在のLLMセキュリティアプローチは、ガードレールによる既知の脅威の特定とブロックという脆弱なイタチごっこに依存しています。私たちは新しいアプローチを提唱します：堅牢な安全性は、有害なものを列挙することからではなく、安全なものを深く理解することから生まれます。本論文では、この原則を「安全性を分布外（OOD）検出問題として扱う」ことで実現するTrust The Typical（T3）フレームワークを提案します。T3は意味空間内で許容可能なプロンプトの分布を学習し、有意な逸脱を潜在的な脅威としてフラグ付けします。従来手法と異なり、有害な事例での学習を必要としないにも関わらず、毒性、ヘイトスピーチ、ジェイルブレイク、多言語害、過剰拒否にわたる18のベンチマークで最先端の性能を達成し、専門的なセキュリティモデルと比較して最大40倍の偽陽性率低減を実現しました。安全な英語テキストのみで学習した単一モデルが、再学習なしで多様なドメインと14言語以上に効果的に汎化します。最後に、GPU最適化版をvLLMに統合し、大規模ワークロードにおいて高密度な評価間隔下でも6%未満のオーバーヘッドでトークン生成中の継続的ガードレイルを実現し、本番環境での適用可能性を実証しました。

English

Current approaches to LLM safety fundamentally rely on a brittle cat-and-mouse game of identifying and blocking known threats via guardrails. We argue for a fresh approach: robust safety comes not from enumerating what is harmful, but from deeply understanding what is safe. We introduce Trust The Typical (T3), a framework that operationalizes this principle by treating safety as an out-of-distribution (OOD) detection problem. T3 learns the distribution of acceptable prompts in a semantic space and flags any significant deviation as a potential threat. Unlike prior methods, it requires no training on harmful examples, yet achieves state-of-the-art performance across 18 benchmarks spanning toxicity, hate speech, jailbreaking, multilingual harms, and over-refusal, reducing false positive rates by up to 40x relative to specialized safety models. A single model trained only on safe English text transfers effectively to diverse domains and over 14 languages without retraining. Finally, we demonstrate production readiness by integrating a GPU-optimized version into vLLM, enabling continuous guardrailing during token generation with less than 6% overhead even under dense evaluation intervals on large-scale workloads.