arXiv: 2511.07315v1
JPRO:基于多智能体协作框架的自动化多模态越狱技术
JPRO: Automated Multimodal Jailbreaking via Multi-Agent Collaboration Framework
November 10, 2025
作者: Yuxuan Zhou, Yang Bai, Kuofeng Gao, Tao Dai, Shu-Tao Xia
cs.CRcs.CR
摘要
大规模视觉语言模型(VLMs)的广泛应用使得确保其安全部署变得至关重要。尽管近期研究已展示了对VLMs的越狱攻击,但现有方法存在局限:它们要么需要白盒访问,限制了实用性,要么依赖于手工构建的模式,导致样本多样性和可扩展性不足。为解决这些问题,我们提出了JPRO,一种新颖的多智能体协作框架,专为自动化VLM越狱设计。它有效克服了先前方法在攻击多样性和可扩展性上的缺陷。通过四个专门智能体的协同作用及其两大核心模块——策略驱动的种子生成与自适应优化循环,JPRO生成了高效且多样化的攻击样本。实验结果表明,JPRO在包括GPT-4o在内的多种先进VLMs上实现了超过60%的攻击成功率,显著优于现有方法。作为一种黑盒攻击手段,JPRO不仅揭示了多模态模型中的关键安全漏洞,还为评估和增强VLM的鲁棒性提供了宝贵洞见。
English
The widespread application of large VLMs makes ensuring their secure deployment critical. While recent studies have demonstrated jailbreak attacks on VLMs, existing approaches are limited: they require either white-box access, restricting practicality, or rely on manually crafted patterns, leading to poor sample diversity and scalability. To address these gaps, we propose JPRO, a novel multi-agent collaborative framework designed for automated VLM jailbreaking. It effectively overcomes the shortcomings of prior methods in attack diversity and scalability. Through the coordinated action of four specialized agents and its two core modules: Tactic-Driven Seed Generation and Adaptive Optimization Loop, JPRO generates effective and diverse attack samples. Experimental results show that JPRO achieves over a 60\% attack success rate on multiple advanced VLMs, including GPT-4o, significantly outperforming existing methods. As a black-box attack approach, JPRO not only uncovers critical security vulnerabilities in multimodal models but also offers valuable insights for evaluating and enhancing VLM robustness.