OMNIGUARD: 다중 모달리티에 걸친 AI 안전 조정을 위한 효율적 접근법

초록

대규모 언어 모델(LLM)의 새로운 능력이 등장하면서, 이들이 즉각적으로 유해한 목적으로 악용될 가능성에 대한 우려가 제기되고 있습니다. 이러한 우려를 완화하기 위한 핵심 접근 방식은 모델에 대한 유해한 질의를 탐지하는 것입니다. 현재의 탐지 방식은 완벽하지 않으며, 특히 모델 능력의 불일치한 일반화를 악용하는 공격(예: 저자원 언어로 작성된 프롬프트 또는 이미지 및 오디오와 같은 비텍스트 방식으로 제공된 프롬프트)에 취약합니다. 이러한 문제를 해결하기 위해, 우리는 언어와 방식에 걸쳐 유해한 프롬프트를 탐지하는 접근법인 OMNIGUARD를 제안합니다. 우리의 접근법은 (i) LLM/MLLM의 내부 표현 중 언어나 방식에 걸쳐 정렬된 것을 식별한 다음, (ii) 이를 사용하여 언어나 방식에 구애받지 않는 유해 프롬프트 탐지 분류기를 구축하는 것입니다. OMNIGUARD는 다국어 환경에서 가장 강력한 기준선 대비 유해 프롬프트 분류 정확도를 11.57% 향상시키고, 이미지 기반 프롬프트에서는 20.44% 향상시키며, 오디오 기반 프롬프트에서는 새로운 최첨단 기술(SOTA)을 설정합니다. 또한, 생성 과정에서 계산된 임베딩을 재활용함으로써 OMNIGUARD는 매우 효율적입니다(다음으로 빠른 기준선 대비 약 120배 빠름). 코드와 데이터는 https://github.com/vsahil/OmniGuard에서 확인할 수 있습니다.

English

The emerging capabilities of large language models (LLMs) have sparked concerns about their immediate potential for harmful misuse. The core approach to mitigate these concerns is the detection of harmful queries to the model. Current detection approaches are fallible, and are particularly susceptible to attacks that exploit mismatched generalization of model capabilities (e.g., prompts in low-resource languages or prompts provided in non-text modalities such as image and audio). To tackle this challenge, we propose OMNIGUARD, an approach for detecting harmful prompts across languages and modalities. Our approach (i) identifies internal representations of an LLM/MLLM that are aligned across languages or modalities and then (ii) uses them to build a language-agnostic or modality-agnostic classifier for detecting harmful prompts. OMNIGUARD improves harmful prompt classification accuracy by 11.57\% over the strongest baseline in a multilingual setting, by 20.44\% for image-based prompts, and sets a new SOTA for audio-based prompts. By repurposing embeddings computed during generation, OMNIGUARD is also very efficient (approx 120 times faster than the next fastest baseline). Code and data are available at: https://github.com/vsahil/OmniGuard.

OMNIGUARD: 다중 모달리티에 걸친 AI 안전 조정을 위한 효율적 접근법

OMNIGUARD: An Efficient Approach for AI Safety Moderation Across Modalities

초록

Support