偽りの安心感：プロービングベースの悪意ある入力検出が一般化に失敗する理由

要旨

大規模言語モデル（LLM）は有害な指示に従う可能性があり、その印象的な能力にもかかわらず重大な安全性の懸念を引き起こします。最近の研究では、LLMの内部表現における悪意のある入力と良性の入力の分離可能性を調査するために、プロービングベースのアプローチが活用されており、研究者たちはそのようなプロービング手法を安全性検出に使用することを提案しています。私たちはこのパラダイムを体系的に再検証します。分布外データに対する性能の低さに動機づけられ、プローブが意味的な有害性ではなく表面的なパターンを学習しているという仮説を立てました。制御された実験を通じて、この仮説を確認し、学習された特定のパターン（指示パターンとトリガーワード）を特定しました。私たちの調査は体系的なアプローチに従い、単純なn-gram手法の同等の性能を示すことから始め、意味的にクリーンなデータセットを用いた制御実験、そしてパターン依存性の詳細な分析へと進めました。これらの結果は、現在のプロービングベースのアプローチに対する誤った安心感を明らかにし、モデルと評価プロトコルの再設計の必要性を強調しています。この方向性での責任あるさらなる研究を提案するために、さらなる議論を提供します。私たちはこのプロジェクトをhttps://github.com/WangCheng0116/Why-Probe-Failsでオープンソース化しました。

English

Large Language Models (LLMs) can comply with harmful instructions, raising serious safety concerns despite their impressive capabilities. Recent work has leveraged probing-based approaches to study the separability of malicious and benign inputs in LLMs' internal representations, and researchers have proposed using such probing methods for safety detection. We systematically re-examine this paradigm. Motivated by poor out-of-distribution performance, we hypothesize that probes learn superficial patterns rather than semantic harmfulness. Through controlled experiments, we confirm this hypothesis and identify the specific patterns learned: instructional patterns and trigger words. Our investigation follows a systematic approach, progressing from demonstrating comparable performance of simple n-gram methods, to controlled experiments with semantically cleaned datasets, to detailed analysis of pattern dependencies. These results reveal a false sense of security around current probing-based approaches and highlight the need to redesign both models and evaluation protocols, for which we provide further discussions in the hope of suggesting responsible further research in this direction. We have open-sourced the project at https://github.com/WangCheng0116/Why-Probe-Fails.

偽りの安心感：プロービングベースの悪意ある入力検出が一般化に失敗する理由

False Sense of Security: Why Probing-based Malicious Input Detection Fails to Generalize

要旨

Support