LLM 내부 안전성: 내부 표현을 활용한 유해 콘텐츠 탐지

초록

가드 모델은 사용자 프롬프트와 LLM 응답의 유해 콘텐츠를 탐지하는 데 널리 사용됩니다. 그러나 최첨단 가드 모델은 최종 계층 표현만을 의존하고 내부 계층 전반에 분포된 풍부한 안전 관련 특징을 간과합니다. 본 연구에서는 이러한 내부 특징을 활용하는 경량 가드 모델인 SIREN을 제시합니다. 선형 탐사(linear probing)를 통해 안전 뉴런을 식별하고 적응형 계층 가중 전략으로 결합함으로써, SIREN은 기반 모델을 수정하지 않고 LLM 내부 상태로부터 유해성 탐지기를 구축합니다. 포괄적인 평가 결과, SIREN은 훈련 가능 매개변수 수를 250배 적게 사용하면서도 여러 벤치마크에서 최첨단 오픈소스 가드 모델들을 크게 능가하는 성능을 보였습니다. 더 나아가 SIREN은 보이지 않는 벤치마크에 대한 우수한 일반화 성능을 보였으며, 실시간 스트리밍 탐지를 자연스럽게 가능하게 하고, 생성형 가드 모델 대비 추론 효율성을 크게 개선했습니다. 종합적으로, 우리의 결과는 LLM 내부 상태가 실용적이고 고성능인 유해성 탐지를 위한 유망한 기반이 됨을 입증합니다.

English

Guard models are widely used to detect harmful content in user prompts and LLM responses. However, state-of-the-art guard models rely solely on terminal-layer representations and overlook the rich safety-relevant features distributed across internal layers. We present SIREN, a lightweight guard model that harnesses these internal features. By identifying safety neurons via linear probing and combining them through an adaptive layer-weighted strategy, SIREN builds a harmfulness detector from LLM internals without modifying the underlying model. Our comprehensive evaluation shows that SIREN substantially outperforms state-of-the-art open-source guard models across multiple benchmarks while using 250 times fewer trainable parameters. Moreover, SIREN exhibits superior generalization to unseen benchmarks, naturally enables real-time streaming detection, and significantly improves inference efficiency compared to generative guard models. Overall, our results highlight LLM internal states as a promising foundation for practical, high-performance harmfulness detection.

LLM 내부 안전성: 내부 표현을 활용한 유해 콘텐츠 탐지

LLM Safety From Within: Detecting Harmful Content with Internal Representations

초록

Support