MUSE: 大規模言語モデルのための実行中心型マルチモーダル統一安全性評価プラットフォーム

要旨

大規模言語モデルの安全性評価およびレッドチーミングは、現在も主にテキスト中心で行われており、既存のフレームワークには、モデルの整合性が音声、画像、動画入力にまで一般化するかを体系的にテストするための基盤が欠けている。本論文では、MUSE（Multimodal Unified Safety Evaluation）を提案する。これは、自動的なクロスモーダルペイロード生成、3種類のマルチターン攻撃アルゴリズム（Crescendo, PAIR, Violent Durian）、プロバイダーに依存しないモデルルーティング、5段階の安全性分類体系を備えたLLM判定器を、単一のブラウザベースシステムに統合したオープンソースの実行中心プラットフォームである。二重指標フレームワークにより、厳格な攻撃成功率（Complianceのみ）と緩やかなASR（Partial Complianceを含む）を区別し、二値指標では見逃されがちな部分的情報漏洩を捕捉する。さらに、整合性がモダリティ境界を越えて一般化するかを検証するため、ターンごとにモダリティを切り替えるInter-Turn Modality Switching（ITMS）を導入し、マルチターン攻撃を拡張する。4つのプロバイダーにわたる6つのマルチモーダルLLMを用いた実験により、単一ターンではほぼ完全な拒否応答を示すモデルに対しても、マルチターン戦略を用いることで最大90-100%のASRを達成できることが示された。ITMSは、既に飽和状態にあるベースラインの最終ASRを一様に向上させるわけではないが、初期ターンの防御を不安定化させることで収束を加速させた。また、 ablation studyにより、モダリティ効果の方向性は普遍的ではなくモデルファミリーに特異的であることが明らかとなり、プロバイダーを意識したクロスモーダル安全性テストの必要性が強調される。

English

Safety evaluation and red-teaming of large language models remain predominantly text-centric, and existing frameworks lack the infrastructure to systematically test whether alignment generalizes to audio, image, and video inputs. We present MUSE (Multimodal Unified Safety Evaluation), an open-source, run-centric platform that integrates automatic cross-modal payload generation, three multi-turn attack algorithms (Crescendo, PAIR, Violent Durian), provider-agnostic model routing, and an LLM judge with a five-level safety taxonomy into a single browser-based system. A dual-metric framework distinguishes hard Attack Success Rate (Compliance only) from soft ASR (including Partial Compliance), capturing partial information leakage that binary metrics miss. To probe whether alignment generalizes across modality boundaries, we introduce Inter-Turn Modality Switching (ITMS), which augments multi-turn attacks with per-turn modality rotation. Experiments across six multimodal LLMs from four providers show that multi-turn strategies can achieve up to 90-100% ASR against models with near-perfect single-turn refusal. ITMS does not uniformly raise final ASR on already-saturated baselines, but accelerates convergence by destabilizing early-turn defenses, and ablation reveals that the direction of modality effects is model-family-specific rather than universal, underscoring the need for provider-aware cross-modal safety testing.

MUSE: 大規模言語モデルのための実行中心型マルチモーダル統一安全性評価プラットフォーム

MUSE: A Run-Centric Platform for Multimodal Unified Safety Evaluation of Large Language Models

要旨

Support