LLMの内部からの安全性：内部表現を用いた有害コンテンツの検出

要旨

ガードモデルは、ユーザーのプロンプトやLLMの応答における有害コンテンツを検出するために広く利用されている。しかし、最先端のガードモデルは最終層の表現のみに依存し、内部層に分散する豊富な安全性関連特徴を見過ごしている。本研究では、これらの内部特徴を活用する軽量ガードモデルSIRENを提案する。線形 probing により安全性ニューロンを特定し、適応的な層重み付け戦略で組み合わせることで、SIRENは基盤モデルを変更することなくLLM内部から有害性検出器を構築する。包括的評価により、SIRENが学習可能パラメータ数を250分の1に抑えつつ、複数のベンチマークで最先端オープンソースガードモデルを大幅に上回る性能を示すことを確認した。さらにSIRENは、未見のベンチマークへの優れた一般化性能を示し、リアルタイムストリーミング検出を自然に実現し、生成型ガードモデルと比較して推論効率を大幅に改善する。総合的に、我々の結果はLLM内部状態が実用的で高性能な有害性検出の有望な基盤であることを示唆している。

English

Guard models are widely used to detect harmful content in user prompts and LLM responses. However, state-of-the-art guard models rely solely on terminal-layer representations and overlook the rich safety-relevant features distributed across internal layers. We present SIREN, a lightweight guard model that harnesses these internal features. By identifying safety neurons via linear probing and combining them through an adaptive layer-weighted strategy, SIREN builds a harmfulness detector from LLM internals without modifying the underlying model. Our comprehensive evaluation shows that SIREN substantially outperforms state-of-the-art open-source guard models across multiple benchmarks while using 250 times fewer trainable parameters. Moreover, SIREN exhibits superior generalization to unseen benchmarks, naturally enables real-time streaming detection, and significantly improves inference efficiency compared to generative guard models. Overall, our results highlight LLM internal states as a promising foundation for practical, high-performance harmfulness detection.

LLMの内部からの安全性：内部表現を用いた有害コンテンツの検出

LLM Safety From Within: Detecting Harmful Content with Internal Representations

要旨

Support