Een Systematische Studie naar Cross-Modale Typografische Aanvallen op Audiovisueel Redeneren

Samenvatting

Aangezien audiovisuele multimodale grote taalmodellen (MLLM's) steeds vaker worden ingezet in veiligheidskritieke toepassingen, is het cruciaal om hun kwetsbaarheden te begrijpen. Hiertoe introduceren we Multimodale Typografie, een systematische studie die onderzoekt hoe typografische aanvallen over meerdere modaliteiten MLLM's nadelig beïnvloeden. Waar eerder werk zich beperkt tot unimodale aanvallen, leggen wij de kruismodale kwetsbaarheid van MLLM's bloot. Wij analyseren de interacties tussen audio-, visuele- en tekstperturbaties en tonen aan dat een gecoördineerde multimodale aanval een aanzienlijk potentere dreiging vormt dan aanvallen in één modaliteit (aanvalsuccespercentage = 83,43% versus 34,93%). Onze bevindingen over meerdere voorhoede-MLLM's, taken, en benchmarks voor gezond verstand en contentmoderatie vestigen multimodale typografie als een kritieke en onderbelichte aanvalsstrategie in multimodaal redeneren. Code en data zullen openbaar beschikbaar worden gesteld.

English

As audio-visual multi-modal large language models (MLLMs) are increasingly deployed in safety-critical applications, understanding their vulnerabilities is crucial. To this end, we introduce Multi-Modal Typography, a systematic study examining how typographic attacks across multiple modalities adversely influence MLLMs. While prior work focuses narrowly on unimodal attacks, we expose the cross-modal fragility of MLLMs. We analyze the interactions between audio, visual, and text perturbations and reveal that coordinated multi-modal attack creates a significantly more potent threat than single-modality attacks (attack success rate = 83.43% vs 34.93%).Our findings across multiple frontier MLLMs, tasks, and common-sense reasoning and content moderation benchmarks establishes multi-modal typography as a critical and underexplored attack strategy in multi-modal reasoning. Code and data will be publicly available.

Een Systematische Studie naar Cross-Modale Typografische Aanvallen op Audiovisueel Redeneren

A Systematic Study of Cross-Modal Typographic Attacks on Audio-Visual Reasoning

Samenvatting

Support