跨模态排版攻击对音视频推理影响的系统性研究

摘要

随着音视频多模态大语言模型（MLLMs）在安全关键应用中的日益普及，理解其脆弱性变得至关重要。为此，我们提出多模态排版攻击研究，系统性地探究跨模态的排版攻击如何对MLLMs产生负面影响。与先前仅关注单模态攻击的研究不同，本文揭示了MLLMs的跨模态脆弱性。通过分析音频、视觉和文本扰动之间的相互作用，我们发现协同多模态攻击能产生比单模态攻击更显著的威胁（攻击成功率83.43% vs 34.93%）。我们在多种前沿MLLMs、任务类型以及常识推理与内容审核基准测试中的实验结果表明，多模态排版攻击已成为多模态推理中至关重要却尚未被充分探索的攻击策略。相关代码与数据将公开共享。

English

As audio-visual multi-modal large language models (MLLMs) are increasingly deployed in safety-critical applications, understanding their vulnerabilities is crucial. To this end, we introduce Multi-Modal Typography, a systematic study examining how typographic attacks across multiple modalities adversely influence MLLMs. While prior work focuses narrowly on unimodal attacks, we expose the cross-modal fragility of MLLMs. We analyze the interactions between audio, visual, and text perturbations and reveal that coordinated multi-modal attack creates a significantly more potent threat than single-modality attacks (attack success rate = 83.43% vs 34.93%).Our findings across multiple frontier MLLMs, tasks, and common-sense reasoning and content moderation benchmarks establishes multi-modal typography as a critical and underexplored attack strategy in multi-modal reasoning. Code and data will be publicly available.

跨模态排版攻击对音视频推理影响的系统性研究

A Systematic Study of Cross-Modal Typographic Attacks on Audio-Visual Reasoning

摘要

Support