Sentinel：防御提示注入攻击的顶尖模型

摘要

大型语言模型（LLMs）日益强大，但仍易受提示注入攻击的影响，恶意输入会导致模型偏离其既定指令。本文介绍了Sentinel，一种基于\answerdotai/ModernBERT-large架构的新型检测模型qualifire/prompt-injection-sentinel。通过利用ModernBERT的先进特性，并在包含多个开源和私有数据集的广泛且多样化的数据集上进行微调，Sentinel实现了最先进的性能。该数据集融合了多种攻击类型，从角色扮演和指令劫持到生成偏见内容的尝试，以及广泛的良性指令，其中私有数据集特别针对细微的错误修正和现实世界中的误分类。在一个全面的、未见过的内部测试集上，Sentinel展示了平均准确率为0.987和F1分数为0.980的优异表现。此外，在公共基准测试中，它始终优于protectai/deberta-v3-base-prompt-injection-v2等强基线模型。本文详细阐述了Sentinel的架构、其精细的数据集构建、训练方法以及全面的评估，凸显了其卓越的检测能力。

English

Large Language Models (LLMs) are increasingly powerful but remain vulnerable to prompt injection attacks, where malicious inputs cause the model to deviate from its intended instructions. This paper introduces Sentinel, a novel detection model, qualifire/prompt-injection-sentinel, based on the \answerdotai/ModernBERT-large architecture. By leveraging ModernBERT's advanced features and fine-tuning on an extensive and diverse dataset comprising a few open-source and private collections, Sentinel achieves state-of-the-art performance. This dataset amalgamates varied attack types, from role-playing and instruction hijacking to attempts to generate biased content, alongside a broad spectrum of benign instructions, with private datasets specifically targeting nuanced error correction and real-world misclassifications. On a comprehensive, unseen internal test set, Sentinel demonstrates an average accuracy of 0.987 and an F1-score of 0.980. Furthermore, when evaluated on public benchmarks, it consistently outperforms strong baselines like protectai/deberta-v3-base-prompt-injection-v2. This work details Sentinel's architecture, its meticulous dataset curation, its training methodology, and a thorough evaluation, highlighting its superior detection capabilities.

Sentinel：防御提示注入攻击的顶尖模型

Sentinel: SOTA model to protect against prompt injections

摘要

Support