VLMGuard: 悪意のあるプロンプトに対するVLMの防御、未ラベルデータを用いて

要旨

ビジョン言語モデル（VLMs）は、視覚的およびテキスト情報の文脈理解に不可欠です。しかし、敵対的に操作された入力に対する脆弱性は、信頼性に関する懸念を引き起こし、VLM統合アプリケーションにおける出力の危険性を高めます。これらの悪意のあるプロンプトを検出することは、VLM生成に対する信頼を維持するために重要です。保護プロンプト分類器を開発する際の主要な課題は、大量のラベル付きの善意と悪意のあるデータが不足していることです。この問題に対処するために、我々はVLMGuardを導入します。これは、野生の未ラベルのユーザープロンプトを悪意のあるプロンプトの検出に活用する新しい学習フレームワークです。これらの未ラベルのプロンプトは、VLMが実世界で展開されるときに自然に発生し、善意と悪意の情報の両方が含まれています。未ラベルのデータを活用するために、我々は、未ラベルの混合物内で善意と悪意のサンプルを区別するための自動悪意推定スコアを提示し、それによりバイナリプロンプト分類器のトレーニングを可能にします。特筆すべきは、当フレームワークは追加の人間の注釈を必要とせず、現実世界のアプリケーションにおいて強い柔軟性と実用性を提供します。広範な実験により、VLMGuardが優れた検出結果を達成し、最先端の手法を大幅に上回ることが示されました。免責事項：本論文には攻撃的な例が含まれる場合があります。読者の慎重な判断が必要です。

English

Vision-language models (VLMs) are essential for contextual understanding of both visual and textual information. However, their vulnerability to adversarially manipulated inputs presents significant risks, leading to compromised outputs and raising concerns about the reliability in VLM-integrated applications. Detecting these malicious prompts is thus crucial for maintaining trust in VLM generations. A major challenge in developing a safeguarding prompt classifier is the lack of a large amount of labeled benign and malicious data. To address the issue, we introduce VLMGuard, a novel learning framework that leverages the unlabeled user prompts in the wild for malicious prompt detection. These unlabeled prompts, which naturally arise when VLMs are deployed in the open world, consist of both benign and malicious information. To harness the unlabeled data, we present an automated maliciousness estimation score for distinguishing between benign and malicious samples within this unlabeled mixture, thereby enabling the training of a binary prompt classifier on top. Notably, our framework does not require extra human annotations, offering strong flexibility and practicality for real-world applications. Extensive experiment shows VLMGuard achieves superior detection results, significantly outperforming state-of-the-art methods. Disclaimer: This paper may contain offensive examples; reader discretion is advised.

VLMGuard: 悪意のあるプロンプトに対するVLMの防御、未ラベルデータを用いて

VLMGuard: Defending VLMs against Malicious Prompts via Unlabeled Data

要旨

Support