안전함의 잘못된 착각: 탐색 기반 악성 입력 탐지가 일반화에 실패하는 이유

초록

대규모 언어 모델(LLMs)은 유해한 지시를 따를 수 있어, 그 인상적인 능력에도 불구하고 심각한 안전 문제를 제기합니다. 최근 연구는 LLM의 내부 표현에서 악성 입력과 양성 입력의 분리 가능성을 연구하기 위해 프로빙(probing) 기반 접근법을 활용했으며, 연구자들은 이러한 프로빙 방법을 안전 탐지에 사용할 것을 제안했습니다. 우리는 이러한 패러다임을 체계적으로 재검토합니다. 분포 외(out-of-distribution) 성능이 낮다는 점에 동기를 받아, 프로브가 의미론적 유해성보다는 피상적인 패턴을 학습한다는 가설을 세웠습니다. 통제된 실험을 통해 이 가설을 확인하고 학습된 특정 패턴, 즉 지시적 패턴과 트리거 단어를 식별했습니다. 우리의 조사는 체계적인 접근 방식을 따르며, 단순한 n-gram 방법의 비교 가능한 성능을 보여주는 것부터 의미론적으로 정제된 데이터셋을 사용한 통제된 실험, 패턴 의존성에 대한 상세한 분석까지 진행됩니다. 이러한 결과는 현재의 프로빙 기반 접근법에 대한 잘못된 안전감을 드러내며, 모델과 평가 프로토콜을 재설계할 필요성을 강조합니다. 우리는 이 방향으로 책임감 있는 추가 연구를 제안하기 위해 더 많은 논의를 제공합니다. 이 프로젝트는 https://github.com/WangCheng0116/Why-Probe-Fails에서 오픈소스로 공개되었습니다.

English

Large Language Models (LLMs) can comply with harmful instructions, raising serious safety concerns despite their impressive capabilities. Recent work has leveraged probing-based approaches to study the separability of malicious and benign inputs in LLMs' internal representations, and researchers have proposed using such probing methods for safety detection. We systematically re-examine this paradigm. Motivated by poor out-of-distribution performance, we hypothesize that probes learn superficial patterns rather than semantic harmfulness. Through controlled experiments, we confirm this hypothesis and identify the specific patterns learned: instructional patterns and trigger words. Our investigation follows a systematic approach, progressing from demonstrating comparable performance of simple n-gram methods, to controlled experiments with semantically cleaned datasets, to detailed analysis of pattern dependencies. These results reveal a false sense of security around current probing-based approaches and highlight the need to redesign both models and evaluation protocols, for which we provide further discussions in the hope of suggesting responsible further research in this direction. We have open-sourced the project at https://github.com/WangCheng0116/Why-Probe-Fails.

안전함의 잘못된 착각: 탐색 기반 악성 입력 탐지가 일반화에 실패하는 이유

False Sense of Security: Why Probing-based Malicious Input Detection Fails to Generalize

초록

Support