믿음은 평범함에 있다

초록

현재 LLM 안전성 접근법은 가드레일을 통해 알려진 위협을 식별하고 차단하는 취약한 술래잡기 게임에 근본적으로 의존하고 있습니다. 우리는 새로운 접근법을 제안합니다. 강력한 안전성은 유해한 내용을 열거하는 데서 오는 것이 아니라, 안전한 내용을 깊이 이해하는 데서 비롯됩니다. 우리는 이러한 원칙을 구현하는 Trust The Typical(T3) 프레임워크를 소개합니다. T3는 안전성을 분포 외 탐지 문제로 취급하여, 의미 공간 내 허용 가능한 프롬프트의 분포를 학습하고 이로부터 크게 벗어나는 모든 편차를 잠재적 위협으로 표시합니다. 기존 방법과 달리 유해한 사례에 대한 학습이 전혀 필요하지 않음에도 불구하고, 독성, 증오 발언, 탈옥, 다국어 유해 콘텐츠, 과도한 거부 등 18개 벤치마크에서 최첨단 성능을 달성하며, 전문화된 안전성 모델 대비 최대 40배까지 위양성 비율을 줄였습니다. 안전한 영어 텍스트만으로 학습된 단일 모델은 재학습 없이도 다양한 도메인과 14개 이상의 언어에 효과적으로 적용됩니다. 마지막으로, GPU 최적화 버전을 vLLM에 통합하여 생산 환경 준비 상태를 입증했습니다. 이는 대규모 워크로드에서 빈번한 평가 간격 하에서도 6% 미만의 오버헤드로 토큰 생성 과정 중 지속적인 가드레일링을 가능하게 합니다.

English

Current approaches to LLM safety fundamentally rely on a brittle cat-and-mouse game of identifying and blocking known threats via guardrails. We argue for a fresh approach: robust safety comes not from enumerating what is harmful, but from deeply understanding what is safe. We introduce Trust The Typical (T3), a framework that operationalizes this principle by treating safety as an out-of-distribution (OOD) detection problem. T3 learns the distribution of acceptable prompts in a semantic space and flags any significant deviation as a potential threat. Unlike prior methods, it requires no training on harmful examples, yet achieves state-of-the-art performance across 18 benchmarks spanning toxicity, hate speech, jailbreaking, multilingual harms, and over-refusal, reducing false positive rates by up to 40x relative to specialized safety models. A single model trained only on safe English text transfers effectively to diverse domains and over 14 languages without retraining. Finally, we demonstrate production readiness by integrating a GPU-optimized version into vLLM, enabling continuous guardrailing during token generation with less than 6% overhead even under dense evaluation intervals on large-scale workloads.