当指令变为视觉：针对大型图像编辑模型的视觉中心型越狱攻击

摘要

大型图像编辑模型的最新进展已将从文本驱动指令转向以视觉提示为主导的编辑范式，用户意图可直接通过标记、箭头及视觉文本提示等视觉输入进行推断。尽管该范式极大拓展了可用性，但也引入了关键且尚未被充分探索的安全风险：攻击面本身已转向视觉维度。本研究提出视觉中心越狱攻击（VJA），这是首个完全通过视觉输入传递恶意指令的视觉到视觉越狱攻击。为系统研究这一新兴威胁，我们构建了IESBench——一个面向安全性的图像编辑模型基准测试集。在IESBench上的大量实验表明，VJA能有效攻破最先进的商业模型，在Nano Banana Pro上攻击成功率高达80.9%，在GPT-Image-1.5上达70.1%。为缓解此漏洞，我们提出基于自省多模态推理的无训练防御方案，该方案可将未充分对齐模型的安全性提升至与商业系统相当的水平，且无需辅助防护模型，计算开销可忽略不计。我们的研究揭示了新的安全漏洞，同时提供基准测试与实用防御方案，以推动安全可信的现代图像编辑系统发展。警告：本文包含由大型图像编辑模型生成的违规图像。

English

Recent advances in large image editing models have shifted the paradigm from text-driven instructions to vision-prompt editing, where user intent is inferred directly from visual inputs such as marks, arrows, and visual-text prompts. While this paradigm greatly expands usability, it also introduces a critical and underexplored safety risk: the attack surface itself becomes visual. In this work, we propose Vision-Centric Jailbreak Attack (VJA), the first visual-to-visual jailbreak attack that conveys malicious instructions purely through visual inputs. To systematically study this emerging threat, we introduce IESBench, a safety-oriented benchmark for image editing models. Extensive experiments on IESBench demonstrate that VJA effectively compromises state-of-the-art commercial models, achieving attack success rates of up to 80.9% on Nano Banana Pro and 70.1% on GPT-Image-1.5. To mitigate this vulnerability, we propose a training-free defense based on introspective multimodal reasoning, which substantially improves the safety of poorly aligned models to a level comparable with commercial systems, without auxiliary guard models and with negligible computational overhead. Our findings expose new vulnerabilities, provide both a benchmark and practical defense to advance safe and trustworthy modern image editing systems. Warning: This paper contains offensive images created by large image editing models.

当指令变为视觉：针对大型图像编辑模型的视觉中心型越狱攻击

When the Prompt Becomes Visual: Vision-Centric Jailbreak Attacks for Large Image Editing Models

摘要

Support