OMNIGUARD：一種跨模態AI安全調控的高效方法

摘要

大型語言模型（LLMs）新興的能力引發了人們對其即時潛在有害濫用的擔憂。解決這些擔憂的核心方法是檢測對模型的有害查詢。目前的檢測方法存在缺陷，特別容易受到利用模型能力泛化不匹配的攻擊（例如，低資源語言的提示或非文本模態如圖像和音頻提供的提示）。為應對這一挑戰，我們提出了OMNIGUARD，一種跨語言和跨模態檢測有害提示的方法。我們的方法（i）識別LLM/MLLM中跨語言或跨模態對齊的內部表示，然後（ii）利用這些表示構建一個語言無關或模態無關的分類器來檢測有害提示。在多語言環境中，OMNIGUARD將有害提示分類的準確率提高了11.57%，在基於圖像的提示中提高了20.44%，並為基於音頻的提示設定了新的SOTA。通過重新利用生成過程中計算的嵌入，OMNIGUARD也非常高效（比次快基線快約120倍）。代碼和數據可在以下網址獲取：https://github.com/vsahil/OmniGuard。

English

The emerging capabilities of large language models (LLMs) have sparked concerns about their immediate potential for harmful misuse. The core approach to mitigate these concerns is the detection of harmful queries to the model. Current detection approaches are fallible, and are particularly susceptible to attacks that exploit mismatched generalization of model capabilities (e.g., prompts in low-resource languages or prompts provided in non-text modalities such as image and audio). To tackle this challenge, we propose OMNIGUARD, an approach for detecting harmful prompts across languages and modalities. Our approach (i) identifies internal representations of an LLM/MLLM that are aligned across languages or modalities and then (ii) uses them to build a language-agnostic or modality-agnostic classifier for detecting harmful prompts. OMNIGUARD improves harmful prompt classification accuracy by 11.57\% over the strongest baseline in a multilingual setting, by 20.44\% for image-based prompts, and sets a new SOTA for audio-based prompts. By repurposing embeddings computed during generation, OMNIGUARD is also very efficient (approx 120 times faster than the next fastest baseline). Code and data are available at: https://github.com/vsahil/OmniGuard.

OMNIGUARD：一種跨模態AI安全調控的高效方法

OMNIGUARD: An Efficient Approach for AI Safety Moderation Across Modalities

摘要

Support