OMNIGUARD: マルチモーダルなAI安全性管理のための効率的なアプローチ

要旨

大規模言語モデル（LLM）の新たな能力は、その即座の悪用可能性に対する懸念を引き起こしています。これらの懸念を緩和するための核心的なアプローチは、モデルに対する有害なクエリの検出です。現在の検出手法は不完全であり、特にモデル能力の不一致した汎化を悪用する攻撃（例えば、低リソース言語でのプロンプトや、画像や音声などの非テキストモダリティで提供されるプロンプト）に対して脆弱です。この課題に取り組むため、我々はOMNIGUARDを提案します。これは、言語やモダリティを超えて有害なプロンプトを検出するアプローチです。我々のアプローチは、(i) LLM/MLLMの内部表現を言語やモダリティ間で整合させ、(ii) それらを使用して言語非依存またはモダリティ非依存の分類器を構築し、有害なプロンプトを検出します。OMNIGUARDは、多言語設定において最も強力なベースラインよりも11.57%、画像ベースのプロンプトでは20.44%の有害プロンプト分類精度を向上させ、音声ベースのプロンプトでは新たなSOTAを達成しました。生成中に計算された埋め込みを再利用することで、OMNIGUARDは非常に効率的でもあります（次に速いベースラインの約120倍の速度）。コードとデータは以下で利用可能です: https://github.com/vsahil/OmniGuard。

English

The emerging capabilities of large language models (LLMs) have sparked concerns about their immediate potential for harmful misuse. The core approach to mitigate these concerns is the detection of harmful queries to the model. Current detection approaches are fallible, and are particularly susceptible to attacks that exploit mismatched generalization of model capabilities (e.g., prompts in low-resource languages or prompts provided in non-text modalities such as image and audio). To tackle this challenge, we propose OMNIGUARD, an approach for detecting harmful prompts across languages and modalities. Our approach (i) identifies internal representations of an LLM/MLLM that are aligned across languages or modalities and then (ii) uses them to build a language-agnostic or modality-agnostic classifier for detecting harmful prompts. OMNIGUARD improves harmful prompt classification accuracy by 11.57\% over the strongest baseline in a multilingual setting, by 20.44\% for image-based prompts, and sets a new SOTA for audio-based prompts. By repurposing embeddings computed during generation, OMNIGUARD is also very efficient (approx 120 times faster than the next fastest baseline). Code and data are available at: https://github.com/vsahil/OmniGuard.

OMNIGUARD: マルチモーダルなAI安全性管理のための効率的なアプローチ

OMNIGUARD: An Efficient Approach for AI Safety Moderation Across Modalities

要旨

Support