Sentinel:防範提示注入攻擊的頂尖模型
Sentinel: SOTA model to protect against prompt injections
June 5, 2025
作者: Dror Ivry, Oran Nahum
cs.AI
摘要
大型語言模型(LLMs)日益強大,但仍易受提示注入攻擊的影響,此類惡意輸入會導致模型偏離其預定指令。本文介紹了Sentinel,一種基於\answerdotai/ModernBERT-large架構的新型檢測模型qualifire/prompt-injection-sentinel。通過利用ModernBERT的先進特性,並在包含多個開源和私有數據集的廣泛且多樣化的數據集上進行微調,Sentinel實現了最先進的性能。該數據集融合了多種攻擊類型,從角色扮演和指令劫持到生成偏見內容的嘗試,以及廣泛的良性指令,其中私有數據集特別針對細微的錯誤修正和現實世界中的誤分類。在一個全面且未見的內部測試集上,Sentinel展示了平均準確率為0.987和F1分數為0.980的表現。此外,在公共基準測試中,它始終優於protectai/deberta-v3-base-prompt-injection-v2等強基線。本文詳細介紹了Sentinel的架構、其精細的數據集策劃、訓練方法以及全面的評估,突出了其卓越的檢測能力。
English
Large Language Models (LLMs) are increasingly powerful but remain vulnerable
to prompt injection attacks, where malicious inputs cause the model to deviate
from its intended instructions. This paper introduces Sentinel, a novel
detection model, qualifire/prompt-injection-sentinel, based on the
\answerdotai/ModernBERT-large architecture. By leveraging ModernBERT's advanced
features and fine-tuning on an extensive and diverse dataset comprising a few
open-source and private collections, Sentinel achieves state-of-the-art
performance. This dataset amalgamates varied attack types, from role-playing
and instruction hijacking to attempts to generate biased content, alongside a
broad spectrum of benign instructions, with private datasets specifically
targeting nuanced error correction and real-world misclassifications. On a
comprehensive, unseen internal test set, Sentinel demonstrates an average
accuracy of 0.987 and an F1-score of 0.980. Furthermore, when evaluated on
public benchmarks, it consistently outperforms strong baselines like
protectai/deberta-v3-base-prompt-injection-v2. This work details Sentinel's
architecture, its meticulous dataset curation, its training methodology, and a
thorough evaluation, highlighting its superior detection capabilities.