虛假的安全感：為何基於探測的惡意輸入檢測難以泛化

摘要

大型語言模型（LLMs）可能遵循有害指令，儘管其能力令人印象深刻，卻引發了嚴重的安全疑慮。近期研究利用探測式方法來探討LLMs內部表徵中惡意與良性輸入的可分離性，並有研究者提出將此類探測方法用於安全檢測。我們系統性地重新審視了這一範式。基於其在分佈外表現不佳的現象，我們假設探測器學習的是表層模式而非語義上的危害性。通過控制實驗，我們證實了這一假設，並識別出所學習的特定模式：指令模式與觸發詞。我們的研究遵循系統化方法，從展示簡單n-gram方法的可比性能，到使用語義清洗數據集的控制實驗，再到模式依賴性的詳細分析。這些結果揭示了當前基於探測的方法所帶來的虛假安全感，並強調了重新設計模型與評估協議的必要性。對此，我們提供了進一步的討論，以期引導該方向上的負責任研究。我們已將項目開源於https://github.com/WangCheng0116/Why-Probe-Fails。

English

Large Language Models (LLMs) can comply with harmful instructions, raising serious safety concerns despite their impressive capabilities. Recent work has leveraged probing-based approaches to study the separability of malicious and benign inputs in LLMs' internal representations, and researchers have proposed using such probing methods for safety detection. We systematically re-examine this paradigm. Motivated by poor out-of-distribution performance, we hypothesize that probes learn superficial patterns rather than semantic harmfulness. Through controlled experiments, we confirm this hypothesis and identify the specific patterns learned: instructional patterns and trigger words. Our investigation follows a systematic approach, progressing from demonstrating comparable performance of simple n-gram methods, to controlled experiments with semantically cleaned datasets, to detailed analysis of pattern dependencies. These results reveal a false sense of security around current probing-based approaches and highlight the need to redesign both models and evaluation protocols, for which we provide further discussions in the hope of suggesting responsible further research in this direction. We have open-sourced the project at https://github.com/WangCheng0116/Why-Probe-Fails.

虛假的安全感：為何基於探測的惡意輸入檢測難以泛化

False Sense of Security: Why Probing-based Malicious Input Detection Fails to Generalize

摘要

Support