MUSE：面向大型语言模型多模态统一安全评估的以运行为核心的平台

摘要

当前大型语言模型的安全评估与红队测试仍主要集中于文本领域，现有框架缺乏系统性检验对齐能力是否泛化至音频、图像及视频输入的基础设施。我们提出MUSE（多模态统一安全评估平台），这一开源且以运行为核心的平台将自动跨模态载荷生成、三种多轮攻击算法（Crescendo、PAIR、Violent Durian）、供应商无关的模型路由，以及采用五级安全分类法的LLM评判器整合至基于浏览器的统一系统中。双指标框架区分了硬性攻击成功率（仅含完全服从）与软性ASR（包含部分服从），可捕捉二元指标所忽略的部分信息泄露。为探究对齐能力是否跨越模态边界泛化，我们引入轮间模态切换技术，通过每轮次模态轮换增强多轮攻击效果。在来自四个供应商的六款多模态LLM上的实验表明：针对单轮拒绝率接近完美的模型，多轮攻击策略可实现90-100%的ASR；ITMS虽未在已饱和的基线上统一提升最终ASR，但通过瓦解早期轮次的防御机制加速收敛；消融实验揭示模态影响的方向具有模型家族特异性而非普适性，这凸显了需开展供应商感知的跨模态安全测试。

English

Safety evaluation and red-teaming of large language models remain predominantly text-centric, and existing frameworks lack the infrastructure to systematically test whether alignment generalizes to audio, image, and video inputs. We present MUSE (Multimodal Unified Safety Evaluation), an open-source, run-centric platform that integrates automatic cross-modal payload generation, three multi-turn attack algorithms (Crescendo, PAIR, Violent Durian), provider-agnostic model routing, and an LLM judge with a five-level safety taxonomy into a single browser-based system. A dual-metric framework distinguishes hard Attack Success Rate (Compliance only) from soft ASR (including Partial Compliance), capturing partial information leakage that binary metrics miss. To probe whether alignment generalizes across modality boundaries, we introduce Inter-Turn Modality Switching (ITMS), which augments multi-turn attacks with per-turn modality rotation. Experiments across six multimodal LLMs from four providers show that multi-turn strategies can achieve up to 90-100% ASR against models with near-perfect single-turn refusal. ITMS does not uniformly raise final ASR on already-saturated baselines, but accelerates convergence by destabilizing early-turn defenses, and ablation reveals that the direction of modality effects is model-family-specific rather than universal, underscoring the need for provider-aware cross-modal safety testing.

MUSE：面向大型语言模型多模态统一安全评估的以运行为核心的平台

MUSE: A Run-Centric Platform for Multimodal Unified Safety Evaluation of Large Language Models

摘要

Support