信任典型
Trust The Typical
February 4, 2026
作者: Debargha Ganguly, Sreehari Sankar, Biyao Zhang, Vikash Singh, Kanan Gupta, Harshini Kavuru, Alan Luo, Weicong Chen, Warren Morningstar, Raghu Machiraju, Vipin Chaudhary
cs.AI
摘要
當前大型語言模型的安全方法,根本上依賴於透過防護機制識別和阻擋已知威脅的脆弱貓鼠遊戲。我們主張採用全新思路:穩健的安全性並非來自窮舉有害內容,而是源於對安全內容的深度理解。為此我們提出「信任典型模式」(T3)框架,將安全問題操作化為語義空間中的分佈外檢測任務。T3通過學習可接受提示的語義分佈,將任何顯著偏離標記為潛在威脅。與既有方法不同,該框架無需針對有害樣本進行訓練,卻在涵蓋毒性言論、仇恨言論、越獄攻擊、多語言危害及過度拒絕等18項基準測試中達到頂尖性能,相較專用安全模型可將誤報率降低達40倍。僅使用安全英文文本訓練的單一模型,無需重新訓練即可有效遷移至多個領域及14種以上語言。最後,我們通過將GPU優化版本整合至vLLM驗證其生產就緒性,即使在大規模工作負載的密集評估間隔下,也能在詞元生成過程中實現持續防護,額外開銷低於6%。
English
Current approaches to LLM safety fundamentally rely on a brittle cat-and-mouse game of identifying and blocking known threats via guardrails. We argue for a fresh approach: robust safety comes not from enumerating what is harmful, but from deeply understanding what is safe. We introduce Trust The Typical (T3), a framework that operationalizes this principle by treating safety as an out-of-distribution (OOD) detection problem. T3 learns the distribution of acceptable prompts in a semantic space and flags any significant deviation as a potential threat. Unlike prior methods, it requires no training on harmful examples, yet achieves state-of-the-art performance across 18 benchmarks spanning toxicity, hate speech, jailbreaking, multilingual harms, and over-refusal, reducing false positive rates by up to 40x relative to specialized safety models. A single model trained only on safe English text transfers effectively to diverse domains and over 14 languages without retraining. Finally, we demonstrate production readiness by integrating a GPU-optimized version into vLLM, enabling continuous guardrailing during token generation with less than 6% overhead even under dense evaluation intervals on large-scale workloads.