相信典型 - 论文详情

摘要

当前的大语言模型安全方法本质上依赖于一种脆弱的"猫鼠游戏"，即通过防护栏识别并阻断已知威胁。我们主张采用全新思路：稳健的安全性不应来自枚举有害内容，而应源于对安全本质的深层理解。我们提出"信任典型模式"（T3）框架，通过将安全问题视为分布外检测任务来实现这一理念。T3在语义空间中学习可接受提示词的分布特征，并将任何显著偏差标记为潜在威胁。与既有方法不同，该框架无需针对有害样本进行训练，却在涵盖毒性言论、仇恨言论、越狱攻击、多语言危害及过度拒绝等18项基准测试中达到领先水平，相较于专用安全模型将误报率降低达40倍。仅使用安全英文文本训练的单一模型，无需重新训练即可有效迁移至不同领域和14种以上语言环境。最后，我们通过将GPU优化版本集成至vLLM验证其生产就绪性，即使在大规模工作负载的密集评估间隔下，也能在令牌生成过程中实现持续防护，额外开销低于6%。

English

Current approaches to LLM safety fundamentally rely on a brittle cat-and-mouse game of identifying and blocking known threats via guardrails. We argue for a fresh approach: robust safety comes not from enumerating what is harmful, but from deeply understanding what is safe. We introduce Trust The Typical (T3), a framework that operationalizes this principle by treating safety as an out-of-distribution (OOD) detection problem. T3 learns the distribution of acceptable prompts in a semantic space and flags any significant deviation as a potential threat. Unlike prior methods, it requires no training on harmful examples, yet achieves state-of-the-art performance across 18 benchmarks spanning toxicity, hate speech, jailbreaking, multilingual harms, and over-refusal, reducing false positive rates by up to 40x relative to specialized safety models. A single model trained only on safe English text transfers effectively to diverse domains and over 14 languages without retraining. Finally, we demonstrate production readiness by integrating a GPU-optimized version into vLLM, enabling continuous guardrailing during token generation with less than 6% overhead even under dense evaluation intervals on large-scale workloads.