Sentinel: プロンプトインジェクションに対する防御を実現する最先端モデル

要旨

大規模言語モデル（LLM）はますます強力になっていますが、依然としてプロンプトインジェクション攻撃に対して脆弱です。この攻撃では、悪意のある入力によってモデルが意図された指示から逸脱してしまいます。本論文では、Sentinelという新しい検出モデル、qualifire/prompt-injection-sentinelを紹介します。このモデルは、\answerdotai/ModernBERT-largeアーキテクチャに基づいており、ModernBERTの高度な機能を活用し、いくつかのオープンソースおよびプライベートコレクションを含む多様で広範なデータセットで微調整を行うことで、最先端の性能を達成しています。このデータセットは、ロールプレイや指示の乗っ取りから偏ったコンテンツの生成試行まで、さまざまな攻撃タイプを統合し、さらに幅広い良性の指示と、微妙な誤り修正や実世界の誤分類に特化したプライベートデータセットを含んでいます。包括的で未見の内部テストセットにおいて、Sentinelは平均精度0.987、F1スコア0.980を達成しました。さらに、公開ベンチマークで評価した場合、protectai/deberta-v3-base-prompt-injection-v2のような強力なベースラインを一貫して上回りました。本論文では、Sentinelのアーキテクチャ、綿密なデータセットキュレーション、トレーニング方法論、そしてその優れた検出能力を強調する徹底的な評価について詳述します。

English

Large Language Models (LLMs) are increasingly powerful but remain vulnerable to prompt injection attacks, where malicious inputs cause the model to deviate from its intended instructions. This paper introduces Sentinel, a novel detection model, qualifire/prompt-injection-sentinel, based on the \answerdotai/ModernBERT-large architecture. By leveraging ModernBERT's advanced features and fine-tuning on an extensive and diverse dataset comprising a few open-source and private collections, Sentinel achieves state-of-the-art performance. This dataset amalgamates varied attack types, from role-playing and instruction hijacking to attempts to generate biased content, alongside a broad spectrum of benign instructions, with private datasets specifically targeting nuanced error correction and real-world misclassifications. On a comprehensive, unseen internal test set, Sentinel demonstrates an average accuracy of 0.987 and an F1-score of 0.980. Furthermore, when evaluated on public benchmarks, it consistently outperforms strong baselines like protectai/deberta-v3-base-prompt-injection-v2. This work details Sentinel's architecture, its meticulous dataset curation, its training methodology, and a thorough evaluation, highlighting its superior detection capabilities.

Sentinel: プロンプトインジェクションに対する防御を実現する最先端モデル

Sentinel: SOTA model to protect against prompt injections

要旨

Support