基於內部表徵的大語言模型安全檢測：從模型內部識別有害內容

摘要

防護模型被廣泛用於檢測用戶提示和大型語言模型回應中的有害內容。然而，現有的頂尖防護模型僅依賴終端層表徵，忽略了分佈在內部各層的豐富安全相關特徵。我們提出SIREN——一種利用這些內部特徵的輕量級防護模型。透過線性探測識別安全神經元，並採用自適應層加權策略進行整合，SIREN能直接基於LLM內部狀態構建有害性檢測器，無需修改底層模型。全面評估顯示，SIREN在多個基準測試中顯著優於當前頂尖開源防護模型，且可訓練參數量僅為後者的1/250。此外，SIREN對未見過的基準測試展現出卓越的泛化能力，天然支持即時流式檢測，與生成式防護模型相比大幅提升推理效率。總體而言，我們的研究成果凸顯出LLM內部狀態可作為實現實用高效有害性檢測的理想基礎。

English

Guard models are widely used to detect harmful content in user prompts and LLM responses. However, state-of-the-art guard models rely solely on terminal-layer representations and overlook the rich safety-relevant features distributed across internal layers. We present SIREN, a lightweight guard model that harnesses these internal features. By identifying safety neurons via linear probing and combining them through an adaptive layer-weighted strategy, SIREN builds a harmfulness detector from LLM internals without modifying the underlying model. Our comprehensive evaluation shows that SIREN substantially outperforms state-of-the-art open-source guard models across multiple benchmarks while using 250 times fewer trainable parameters. Moreover, SIREN exhibits superior generalization to unseen benchmarks, naturally enables real-time streaming detection, and significantly improves inference efficiency compared to generative guard models. Overall, our results highlight LLM internal states as a promising foundation for practical, high-performance harmfulness detection.

基於內部表徵的大語言模型安全檢測：從模型內部識別有害內容

LLM Safety From Within: Detecting Harmful Content with Internal Representations

摘要

Support