JPRO: 다중 에이전트 협업 프레임워크를 통한 자동화된 멀티모달 제이브레이킹

초록

대규모 시각-언어 모델(VLM)의 광범위한 적용은 이들의 안전한 배포를 보장하는 것을 중요하게 만듭니다. 최근 연구에서 VLM에 대한 탈옥 공격(jailbreak attack)이 가능함이 입증되었지만, 기존 접근법은 제한적입니다: 이들은 화이트박스 접근이 필요하거나, 실용성을 제한하거나, 수작업으로 제작된 패턴에 의존하여 샘플 다양성과 확장성이 떨어지는 문제가 있습니다. 이러한 한계를 해결하기 위해, 우리는 자동화된 VLM 탈옥을 위해 설계된 새로운 다중 에이전트 협업 프레임워크인 JPRO를 제안합니다. JPRO는 공격 다양성과 확장성 측면에서 기존 방법의 단점을 효과적으로 극복합니다. 네 개의 특화된 에이전트와 두 가지 핵심 모듈인 전략 기반 시드 생성(Tactic-Driven Seed Generation) 및 적응형 최적화 루프(Adaptive Optimization Loop)를 통해 JPRO는 효과적이고 다양한 공격 샘플을 생성합니다. 실험 결과, JPRO는 GPT-4o를 포함한 여러 고급 VLM에서 60% 이상의 공격 성공률을 달성하며, 기존 방법을 크게 능가하는 성능을 보여줍니다. 블랙박스 공격 접근법으로서, JPRO는 다중모달 모델의 중요한 보안 취약점을 발견할 뿐만 아니라 VLM의 견고성을 평가하고 강화하는 데 유용한 통찰을 제공합니다.

English

The widespread application of large VLMs makes ensuring their secure deployment critical. While recent studies have demonstrated jailbreak attacks on VLMs, existing approaches are limited: they require either white-box access, restricting practicality, or rely on manually crafted patterns, leading to poor sample diversity and scalability. To address these gaps, we propose JPRO, a novel multi-agent collaborative framework designed for automated VLM jailbreaking. It effectively overcomes the shortcomings of prior methods in attack diversity and scalability. Through the coordinated action of four specialized agents and its two core modules: Tactic-Driven Seed Generation and Adaptive Optimization Loop, JPRO generates effective and diverse attack samples. Experimental results show that JPRO achieves over a 60\% attack success rate on multiple advanced VLMs, including GPT-4o, significantly outperforming existing methods. As a black-box attack approach, JPRO not only uncovers critical security vulnerabilities in multimodal models but also offers valuable insights for evaluating and enhancing VLM robustness.

JPRO: 다중 에이전트 협업 프레임워크를 통한 자동화된 멀티모달 제이브레이킹

JPRO: Automated Multimodal Jailbreaking via Multi-Agent Collaboration Framework

초록

Support