JPRO: マルチエージェント協調フレームワークによる自動化マルチモーダルジャイルブレイキング

要旨

大規模な視覚言語モデル（VLM）の広範な応用により、その安全な展開を確保することが極めて重要となっている。最近の研究では、VLMに対するジャイルブレイク攻撃が実証されているが、既存のアプローチには限界がある。それらは、実用性を制限するホワイトボックスアクセスを必要とするか、手動で作成されたパターンに依存しており、サンプルの多様性とスケーラビリティが低いという問題がある。これらの課題を解決するため、我々は自動化されたVLMジャイルブレイクを目的とした新しいマルチエージェント協調フレームワークであるJPROを提案する。JPROは、攻撃の多様性とスケーラビリティにおいて、従来の手法の欠点を効果的に克服する。4つの専門エージェントとその2つのコアモジュールである「戦略駆動型シード生成」と「適応最適化ループ」の連携により、JPROは効果的で多様な攻撃サンプルを生成する。実験結果は、JPROがGPT-4oを含む複数の先進的なVLMにおいて60％以上の攻撃成功率を達成し、既存の手法を大幅に上回ることを示している。ブラックボックス攻撃アプローチとして、JPROはマルチモーダルモデルにおける重要なセキュリティ脆弱性を明らかにするだけでなく、VLMの堅牢性を評価し強化するための貴重な知見を提供する。

English

The widespread application of large VLMs makes ensuring their secure deployment critical. While recent studies have demonstrated jailbreak attacks on VLMs, existing approaches are limited: they require either white-box access, restricting practicality, or rely on manually crafted patterns, leading to poor sample diversity and scalability. To address these gaps, we propose JPRO, a novel multi-agent collaborative framework designed for automated VLM jailbreaking. It effectively overcomes the shortcomings of prior methods in attack diversity and scalability. Through the coordinated action of four specialized agents and its two core modules: Tactic-Driven Seed Generation and Adaptive Optimization Loop, JPRO generates effective and diverse attack samples. Experimental results show that JPRO achieves over a 60\% attack success rate on multiple advanced VLMs, including GPT-4o, significantly outperforming existing methods. As a black-box attack approach, JPRO not only uncovers critical security vulnerabilities in multimodal models but also offers valuable insights for evaluating and enhancing VLM robustness.

JPRO: マルチエージェント協調フレームワークによる自動化マルチモーダルジャイルブレイキング

JPRO: Automated Multimodal Jailbreaking via Multi-Agent Collaboration Framework

要旨

Support