오디오-시각 추론에 대한 교차 모달 타이포그래피 공격의 체계적 연구

초록

시각-청각 다중 모드 대형 언어 모델(MLLM)이 안전 중시 애플리케이션에 점점 더 많이 배포됨에 따라, 이들의 취약점을 이해하는 것이 중요해졌습니다. 이를 위해 우리는 다중 모드에 걸친 타이포그래피 공격이 MLLM에 어떻게 부정적인 영향을 미치는지 체계적으로 연구한 '다중 모드 타이포그래피'를 소개합니다. 기존 연구가 단일 모드 공격에만 집중한 반면, 우리는 MLLM의 교차 모드 취약성을 밝혀냅니다. 우리는 오디오, 시각, 텍스트 섭동 간의 상호작용을 분석하고, 조정된 다중 모드 공격이 단일 모드 공격보다 훨씬 더 강력한 위협을 생성함을 보여줍니다(공격 성공률 = 83.43% 대 34.93%). 다양한 최신 MLLM, 작업, 상식 추론 및 콘텐츠 관리 벤치마크를 통한 우리의 연구 결과는 다중 모드 타이포그래피가 다중 모드 추론에서 중요하면서도 충분히 탐구되지 않은 공격 전략임을 입증합니다. 코드와 데이터는 공개될 예정입니다.

English

As audio-visual multi-modal large language models (MLLMs) are increasingly deployed in safety-critical applications, understanding their vulnerabilities is crucial. To this end, we introduce Multi-Modal Typography, a systematic study examining how typographic attacks across multiple modalities adversely influence MLLMs. While prior work focuses narrowly on unimodal attacks, we expose the cross-modal fragility of MLLMs. We analyze the interactions between audio, visual, and text perturbations and reveal that coordinated multi-modal attack creates a significantly more potent threat than single-modality attacks (attack success rate = 83.43% vs 34.93%).Our findings across multiple frontier MLLMs, tasks, and common-sense reasoning and content moderation benchmarks establishes multi-modal typography as a critical and underexplored attack strategy in multi-modal reasoning. Code and data will be publicly available.

오디오-시각 추론에 대한 교차 모달 타이포그래피 공격의 체계적 연구

A Systematic Study of Cross-Modal Typographic Attacks on Audio-Visual Reasoning

초록

Support