虛假的安全感:為何基於探測的惡意輸入檢測難以泛化
False Sense of Security: Why Probing-based Malicious Input Detection Fails to Generalize
September 4, 2025
作者: Cheng Wang, Zeming Wei, Qin Liu, Muhao Chen
cs.AI
摘要
大型語言模型(LLMs)可能遵循有害指令,儘管其能力令人印象深刻,卻引發了嚴重的安全疑慮。近期研究利用探測式方法來探討LLMs內部表徵中惡意與良性輸入的可分離性,並有研究者提出將此類探測方法用於安全檢測。我們系統性地重新審視了這一範式。基於其在分佈外表現不佳的現象,我們假設探測器學習的是表層模式而非語義上的危害性。通過控制實驗,我們證實了這一假設,並識別出所學習的特定模式:指令模式與觸發詞。我們的研究遵循系統化方法,從展示簡單n-gram方法的可比性能,到使用語義清洗數據集的控制實驗,再到模式依賴性的詳細分析。這些結果揭示了當前基於探測的方法所帶來的虛假安全感,並強調了重新設計模型與評估協議的必要性。對此,我們提供了進一步的討論,以期引導該方向上的負責任研究。我們已將項目開源於https://github.com/WangCheng0116/Why-Probe-Fails。
English
Large Language Models (LLMs) can comply with harmful instructions, raising
serious safety concerns despite their impressive capabilities. Recent work has
leveraged probing-based approaches to study the separability of malicious and
benign inputs in LLMs' internal representations, and researchers have proposed
using such probing methods for safety detection. We systematically re-examine
this paradigm. Motivated by poor out-of-distribution performance, we
hypothesize that probes learn superficial patterns rather than semantic
harmfulness. Through controlled experiments, we confirm this hypothesis and
identify the specific patterns learned: instructional patterns and trigger
words. Our investigation follows a systematic approach, progressing from
demonstrating comparable performance of simple n-gram methods, to controlled
experiments with semantically cleaned datasets, to detailed analysis of pattern
dependencies. These results reveal a false sense of security around current
probing-based approaches and highlight the need to redesign both models and
evaluation protocols, for which we provide further discussions in the hope of
suggesting responsible further research in this direction. We have open-sourced
the project at https://github.com/WangCheng0116/Why-Probe-Fails.