音声視覚推論に対するクロスモダリティ文字組版攻撃の体系的検討

要旨

オーディオビジュアルマルチモーダル大規模言語モデル（MLLM）が安全性が重要なアプリケーションで展開されるにつれ、その脆弱性を理解することが極めて重要となっている。本研究では、複数モダリティにわたるタイポグラフィ攻撃がMLLMに与える悪影響を体系的に検証する「マルチモーダル・タイポグラフィ」を提案する。既存研究が単一モダリティ攻撃に焦点を当てる中、我々はMLLMのクロスモーダルな脆弱性を明らかにする。音声、視覚、テキストの摂動間の相互作用を分析し、調整されたマルチモーダル攻撃が単一モダリティ攻撃（攻撃成功率34.93%）よりもはるかに強力な脅威（攻撃成功率83.43%）を生み出すことを実証する。複数の最先端MLLM、タスク、常識推論およびコンテンツモデレーションベンチマークにおける検証結果を通じて、マルチモーダル推論におけるマルチモーダル・タイポグラフィが重要かつ未開拓の攻撃手法であることを立証する。コードとデータは公開予定である。

English

As audio-visual multi-modal large language models (MLLMs) are increasingly deployed in safety-critical applications, understanding their vulnerabilities is crucial. To this end, we introduce Multi-Modal Typography, a systematic study examining how typographic attacks across multiple modalities adversely influence MLLMs. While prior work focuses narrowly on unimodal attacks, we expose the cross-modal fragility of MLLMs. We analyze the interactions between audio, visual, and text perturbations and reveal that coordinated multi-modal attack creates a significantly more potent threat than single-modality attacks (attack success rate = 83.43% vs 34.93%).Our findings across multiple frontier MLLMs, tasks, and common-sense reasoning and content moderation benchmarks establishes multi-modal typography as a critical and underexplored attack strategy in multi-modal reasoning. Code and data will be publicly available.

音声視覚推論に対するクロスモダリティ文字組版攻撃の体系的検討

A Systematic Study of Cross-Modal Typographic Attacks on Audio-Visual Reasoning

要旨

Support