Eine systematische Untersuchung von cross-modalen typografischen Angriffen auf audiovisuelle Argumentation

Zusammenfassung

Da audio-visuelle multimodale Large Language Models (MLLMs) zunehmend in sicherheitskritischen Anwendungen eingesetzt werden, ist das Verständnis ihrer Schwachstellen von entscheidender Bedeutung. Zu diesem Zweck führen wir Multimodale Typografie ein, eine systematische Untersuchung, die analysiert, wie typografische Angriffe über mehrere Modalitäten hinweg MLLMs beeinträchtigen. Während sich frühere Arbeiten eng auf unimodale Angriffe konzentrierten, beleuchten wir die cross-modale Fragilität von MLLMs. Wir analysieren die Wechselwirkungen zwischen audio-visuellen und Text-Perturbationen und zeigen, dass koordinierte multimodale Angriffe eine signifikant größere Bedrohung darstellen als Einzelmodalitätsangriffe (Angriffserfolgsrate = 83,43 % vs. 34,93 %). Unsere Ergebnisse über mehrere frontier MLLMs, Aufgaben sowie Common-Sense-Reasoning- und Content-Moderation-Benchmarks hinweg etablieren die multimodale Typografie als kritische und unzureichend erforschte Angriffsstrategie im multimodalen Reasoning. Code und Daten werden öffentlich verfügbar sein.

English

As audio-visual multi-modal large language models (MLLMs) are increasingly deployed in safety-critical applications, understanding their vulnerabilities is crucial. To this end, we introduce Multi-Modal Typography, a systematic study examining how typographic attacks across multiple modalities adversely influence MLLMs. While prior work focuses narrowly on unimodal attacks, we expose the cross-modal fragility of MLLMs. We analyze the interactions between audio, visual, and text perturbations and reveal that coordinated multi-modal attack creates a significantly more potent threat than single-modality attacks (attack success rate = 83.43% vs 34.93%).Our findings across multiple frontier MLLMs, tasks, and common-sense reasoning and content moderation benchmarks establishes multi-modal typography as a critical and underexplored attack strategy in multi-modal reasoning. Code and data will be publicly available.

Eine systematische Untersuchung von cross-modalen typografischen Angriffen auf audiovisuelle Argumentation

A Systematic Study of Cross-Modal Typographic Attacks on Audio-Visual Reasoning

Zusammenfassung

Support