虚假的安全感：为何基于探测的恶意输入检测难以泛化

摘要

大型语言模型（LLMs）能够遵循有害指令，尽管其能力令人瞩目，却引发了严重的安全隐患。近期研究采用探测式方法，探究了LLMs内部表征中恶意与良性输入的可分离性，并提议利用此类探测手段进行安全检测。我们系统性地重新审视了这一范式。鉴于其在分布外数据上的表现欠佳，我们推测探测方法仅习得了表面模式而非语义上的危害性。通过控制实验，我们验证了这一假设，并识别出所学习的特定模式：指令模式与触发词。我们的研究遵循系统化路径，从展示简单n-gram方法的可比性能，到使用语义净化数据集的控制实验，再到模式依赖性的细致分析。这些结果揭示了当前探测式方法带来的虚假安全感，并强调了重新设计模型与评估协议的必要性。为此，我们提供了进一步的讨论，旨在引导该方向上的负责任研究。项目已开源，地址为https://github.com/WangCheng0116/Why-Probe-Fails。

English

Large Language Models (LLMs) can comply with harmful instructions, raising serious safety concerns despite their impressive capabilities. Recent work has leveraged probing-based approaches to study the separability of malicious and benign inputs in LLMs' internal representations, and researchers have proposed using such probing methods for safety detection. We systematically re-examine this paradigm. Motivated by poor out-of-distribution performance, we hypothesize that probes learn superficial patterns rather than semantic harmfulness. Through controlled experiments, we confirm this hypothesis and identify the specific patterns learned: instructional patterns and trigger words. Our investigation follows a systematic approach, progressing from demonstrating comparable performance of simple n-gram methods, to controlled experiments with semantically cleaned datasets, to detailed analysis of pattern dependencies. These results reveal a false sense of security around current probing-based approaches and highlight the need to redesign both models and evaluation protocols, for which we provide further discussions in the hope of suggesting responsible further research in this direction. We have open-sourced the project at https://github.com/WangCheng0116/Why-Probe-Fails.

虚假的安全感：为何基于探测的恶意输入检测难以泛化

False Sense of Security: Why Probing-based Malicious Input Detection Fails to Generalize

摘要

Support