ChatPaper.aiChatPaper

相信典型

Trust The Typical

February 4, 2026
作者: Debargha Ganguly, Sreehari Sankar, Biyao Zhang, Vikash Singh, Kanan Gupta, Harshini Kavuru, Alan Luo, Weicong Chen, Warren Morningstar, Raghu Machiraju, Vipin Chaudhary
cs.AI

摘要

当前的大语言模型安全方法本质上依赖于一种脆弱的"猫鼠游戏",即通过防护栏识别并阻断已知威胁。我们主张采用全新思路:稳健的安全性不应来自枚举有害内容,而应源于对安全本质的深层理解。我们提出"信任典型模式"(T3)框架,通过将安全问题视为分布外检测任务来实现这一理念。T3在语义空间中学习可接受提示词的分布特征,并将任何显著偏差标记为潜在威胁。与既有方法不同,该框架无需针对有害样本进行训练,却在涵盖毒性言论、仇恨言论、越狱攻击、多语言危害及过度拒绝等18项基准测试中达到领先水平,相较于专用安全模型将误报率降低达40倍。仅使用安全英文文本训练的单一模型,无需重新训练即可有效迁移至不同领域和14种以上语言环境。最后,我们通过将GPU优化版本集成至vLLM验证其生产就绪性,即使在大规模工作负载的密集评估间隔下,也能在令牌生成过程中实现持续防护,额外开销低于6%。
English
Current approaches to LLM safety fundamentally rely on a brittle cat-and-mouse game of identifying and blocking known threats via guardrails. We argue for a fresh approach: robust safety comes not from enumerating what is harmful, but from deeply understanding what is safe. We introduce Trust The Typical (T3), a framework that operationalizes this principle by treating safety as an out-of-distribution (OOD) detection problem. T3 learns the distribution of acceptable prompts in a semantic space and flags any significant deviation as a potential threat. Unlike prior methods, it requires no training on harmful examples, yet achieves state-of-the-art performance across 18 benchmarks spanning toxicity, hate speech, jailbreaking, multilingual harms, and over-refusal, reducing false positive rates by up to 40x relative to specialized safety models. A single model trained only on safe English text transfers effectively to diverse domains and over 14 languages without retraining. Finally, we demonstrate production readiness by integrating a GPU-optimized version into vLLM, enabling continuous guardrailing during token generation with less than 6% overhead even under dense evaluation intervals on large-scale workloads.
PDF13March 21, 2026