信任典型 - 論文詳情

摘要

當前大型語言模型的安全方法，根本上依賴於透過防護機制識別和阻擋已知威脅的脆弱貓鼠遊戲。我們主張採用全新思路：穩健的安全性並非來自窮舉有害內容，而是源於對安全內容的深度理解。為此我們提出「信任典型模式」（T3）框架，將安全問題操作化為語義空間中的分佈外檢測任務。T3通過學習可接受提示的語義分佈，將任何顯著偏離標記為潛在威脅。與既有方法不同，該框架無需針對有害樣本進行訓練，卻在涵蓋毒性言論、仇恨言論、越獄攻擊、多語言危害及過度拒絕等18項基準測試中達到頂尖性能，相較專用安全模型可將誤報率降低達40倍。僅使用安全英文文本訓練的單一模型，無需重新訓練即可有效遷移至多個領域及14種以上語言。最後，我們通過將GPU優化版本整合至vLLM驗證其生產就緒性，即使在大規模工作負載的密集評估間隔下，也能在詞元生成過程中實現持續防護，額外開銷低於6%。

English

Current approaches to LLM safety fundamentally rely on a brittle cat-and-mouse game of identifying and blocking known threats via guardrails. We argue for a fresh approach: robust safety comes not from enumerating what is harmful, but from deeply understanding what is safe. We introduce Trust The Typical (T3), a framework that operationalizes this principle by treating safety as an out-of-distribution (OOD) detection problem. T3 learns the distribution of acceptable prompts in a semantic space and flags any significant deviation as a potential threat. Unlike prior methods, it requires no training on harmful examples, yet achieves state-of-the-art performance across 18 benchmarks spanning toxicity, hate speech, jailbreaking, multilingual harms, and over-refusal, reducing false positive rates by up to 40x relative to specialized safety models. A single model trained only on safe English text transfers effectively to diverse domains and over 14 languages without retraining. Finally, we demonstrate production readiness by integrating a GPU-optimized version into vLLM, enabling continuous guardrailing during token generation with less than 6% overhead even under dense evaluation intervals on large-scale workloads.