OmniSafeBench-MM:多模態越獄攻防評估的統一基準與工具箱
OmniSafeBench-MM: A Unified Benchmark and Toolbox for Multimodal Jailbreak Attack-Defense Evaluation
December 6, 2025
作者: Xiaojun Jia, Jie Liao, Qi Guo, Teng Ma, Simeng Qin, Ranjie Duan, Tianlin Li, Yihao Huang, Zhitao Zeng, Dongxian Wu, Yiming Li, Wenqi Ren, Xiaochun Cao, Yang Liu
cs.AI
摘要
近年來,多模態大型語言模型(MLLMs)的快速發展實現了統一的感知-推理能力,然而這些系統仍極易受到越獄攻擊的影響,導致安全防護機制被繞過並誘發有害行為。現有基準如JailBreakV-28K、MM-SafetyBench和HADES雖為多模態漏洞研究提供了重要參考,但普遍存在攻擊場景覆蓋有限、缺乏標準化防禦評估框架,且未提供統一可重現的工具箱等問題。為此,我們推出OmniSafeBench-MM——一個面向多模態越獄攻防評估的綜合工具箱。該平台整合了13種代表性攻擊方法、15種防禦策略,以及涵蓋9大風險領域與50個細分類別的多元化數據集,並通過諮詢式、指令式與陳述式三類查詢結構模擬真實用戶意圖。除數據覆蓋度外,我們建立三維評估協議,量化衡量:(1)危害程度,採用從低影響個體危害到災難性社會威脅的多級細粒度標度;(2)回應與查詢的意圖對齊度;(3)回應詳盡度,從而實現安全-效用的細緻平衡分析。我們對10個開源與8個閉源MLLMs進行大規模實驗,揭示其對多模態越獄的脆弱性。通過將數據、方法與評估整合為開源可重現平台,OmniSafeBench-MM為未來研究建立了標準化基礎。程式碼已發佈於:https://github.com/jiaxiaojunQAQ/OmniSafeBench-MM。
English
Recent advances in multi-modal large language models (MLLMs) have enabled unified perception-reasoning capabilities, yet these systems remain highly vulnerable to jailbreak attacks that bypass safety alignment and induce harmful behaviors. Existing benchmarks such as JailBreakV-28K, MM-SafetyBench, and HADES provide valuable insights into multi-modal vulnerabilities, but they typically focus on limited attack scenarios, lack standardized defense evaluation, and offer no unified, reproducible toolbox. To address these gaps, we introduce OmniSafeBench-MM, which is a comprehensive toolbox for multi-modal jailbreak attack-defense evaluation. OmniSafeBench-MM integrates 13 representative attack methods, 15 defense strategies, and a diverse dataset spanning 9 major risk domains and 50 fine-grained categories, structured across consultative, imperative, and declarative inquiry types to reflect realistic user intentions. Beyond data coverage, it establishes a three-dimensional evaluation protocol measuring (1) harmfulness, distinguished by a granular, multi-level scale ranging from low-impact individual harm to catastrophic societal threats, (2) intent alignment between responses and queries, and (3) response detail level, enabling nuanced safety-utility analysis. We conduct extensive experiments on 10 open-source and 8 closed-source MLLMs to reveal their vulnerability to multi-modal jailbreak. By unifying data, methodology, and evaluation into an open-source, reproducible platform, OmniSafeBench-MM provides a standardized foundation for future research. The code is released at https://github.com/jiaxiaojunQAQ/OmniSafeBench-MM.