当提示词视觉化:针对大型图像编辑模型的视觉中心越狱攻击
When the Prompt Becomes Visual: Vision-Centric Jailbreak Attacks for Large Image Editing Models
February 10, 2026
作者: Jiacheng Hou, Yining Sun, Ruochong Jin, Haochen Han, Fangming Liu, Wai Kin Victor Chan, Alex Jinpeng Wang
cs.AI
摘要
近期大型图像编辑模型的进展已从文本驱动指令转向视觉提示编辑范式,用户意图可直接通过标记、箭头及视觉文本提示等视觉输入进行推断。尽管这一范式极大拓展了可用性,但也引入了关键且尚未被充分探索的安全风险:攻击载体本身呈现视觉化特征。本研究提出首个纯视觉输入传递恶意指令的视觉中心越狱攻击方法VJA,并构建面向安全性的图像编辑模型基准测试平台IESBench以系统评估这一新兴威胁。在IESBench上的大量实验表明,VJA能有效攻破顶尖商业模型,在Nano Banana Pro和GPT-Image-1.5上的攻击成功率分别达80.9%和70.1%。为缓解该漏洞,我们提出基于内省多模态推理的无训练防御方案,无需辅助防护模型且计算开销可忽略,即可将低对齐模型的安全性提升至商业系统水平。本研究揭示了新型安全漏洞,同时提供基准测试与实用防御方案,以推动构建安全可信的现代图像编辑系统。注:本文包含由大型图像编辑模型生成的违规图像。
English
Recent advances in large image editing models have shifted the paradigm from text-driven instructions to vision-prompt editing, where user intent is inferred directly from visual inputs such as marks, arrows, and visual-text prompts. While this paradigm greatly expands usability, it also introduces a critical and underexplored safety risk: the attack surface itself becomes visual. In this work, we propose Vision-Centric Jailbreak Attack (VJA), the first visual-to-visual jailbreak attack that conveys malicious instructions purely through visual inputs. To systematically study this emerging threat, we introduce IESBench, a safety-oriented benchmark for image editing models. Extensive experiments on IESBench demonstrate that VJA effectively compromises state-of-the-art commercial models, achieving attack success rates of up to 80.9% on Nano Banana Pro and 70.1% on GPT-Image-1.5. To mitigate this vulnerability, we propose a training-free defense based on introspective multimodal reasoning, which substantially improves the safety of poorly aligned models to a level comparable with commercial systems, without auxiliary guard models and with negligible computational overhead. Our findings expose new vulnerabilities, provide both a benchmark and practical defense to advance safe and trustworthy modern image editing systems. Warning: This paper contains offensive images created by large image editing models.