Sentinel: 프롬프트 주입 공격 방어를 위한 최첨단 모델

초록

대형 언어 모델(LLMs)은 점점 더 강력해지고 있지만, 악의적인 입력으로 인해 모델이 의도된 지시에서 벗어나게 만드는 프롬프트 인젝션 공격에 취약한 상태로 남아 있습니다. 본 논문은 \answerdotai/ModernBERT-large 아키텍처를 기반으로 한 새로운 탐지 모델인 Sentinel(qualifire/prompt-injection-sentinel)을 소개합니다. Sentinel은 ModernBERT의 고급 기능을 활용하고, 오픈소스 및 비공개 컬렉션으로 구성된 광범위하고 다양한 데이터셋에 대한 미세 조정을 통해 최첨단 성능을 달성합니다. 이 데이터셋은 역할 수행 및 지시 하이재킹부터 편향된 콘텐츠 생성 시도에 이르기까지 다양한 공격 유형과 함께, 광범위한 정상 지시를 포함하며, 특히 미묘한 오류 수정과 실제 오분류를 대상으로 한 비공개 데이터셋을 통합합니다. 포괄적이고 미검증된 내부 테스트 세트에서 Sentinel은 평균 정확도 0.987과 F1 점수 0.980을 보여줍니다. 또한, 공개 벤치마크에서 평가할 때 protectai/deberta-v3-base-prompt-injection-v2와 같은 강력한 베이스라인을 지속적으로 능가합니다. 이 연구는 Sentinel의 아키텍처, 세심한 데이터셋 구축, 훈련 방법론, 그리고 우수한 탐지 능력을 강조하는 철저한 평가를 상세히 설명합니다.

English

Large Language Models (LLMs) are increasingly powerful but remain vulnerable to prompt injection attacks, where malicious inputs cause the model to deviate from its intended instructions. This paper introduces Sentinel, a novel detection model, qualifire/prompt-injection-sentinel, based on the \answerdotai/ModernBERT-large architecture. By leveraging ModernBERT's advanced features and fine-tuning on an extensive and diverse dataset comprising a few open-source and private collections, Sentinel achieves state-of-the-art performance. This dataset amalgamates varied attack types, from role-playing and instruction hijacking to attempts to generate biased content, alongside a broad spectrum of benign instructions, with private datasets specifically targeting nuanced error correction and real-world misclassifications. On a comprehensive, unseen internal test set, Sentinel demonstrates an average accuracy of 0.987 and an F1-score of 0.980. Furthermore, when evaluated on public benchmarks, it consistently outperforms strong baselines like protectai/deberta-v3-base-prompt-injection-v2. This work details Sentinel's architecture, its meticulous dataset curation, its training methodology, and a thorough evaluation, highlighting its superior detection capabilities.

Sentinel: 프롬프트 주입 공격 방어를 위한 최첨단 모델

Sentinel: SOTA model to protect against prompt injections

초록

Support