模型在多大程度上遵循视觉指令？VIBE：视觉指令驱动图像编辑的系统性基准

摘要

近期生成模型在图像编辑领域取得了显著进展。然而现有系统和基准测试仍主要基于文本引导。相比之下，人类交流本质上是多模态的，其中草图等视觉指令能有效传递空间与结构意图。为弥补这一差距，我们推出VIBE（视觉引导图像编辑基准），该框架采用三级交互体系，涵盖指示性定位、形态操控与因果推理三个层次。我们在这三个层级上精心构建了高质量、多样化的测试案例，体现视觉指令跟随任务的渐进式复杂度提升。此外，我们提出基于大语言模型的评估框架，结合任务特异性指标，实现可扩展的细粒度评估。通过对17个代表性开源与商业图像编辑模型的系统评测，发现商业模型已具备初级视觉指令跟随能力且持续优于开源模型。但随着任务难度增加，即使最强系统的性能也会显著下降，这为未来研究指明了富有前景的方向。

English

Recent generative models have achieved remarkable progress in image editing. However, existing systems and benchmarks remain largely text-guided. In contrast, human communication is inherently multimodal, where visual instructions such as sketches efficiently convey spatial and structural intent. To address this gap, we introduce VIBE, the Visual Instruction Benchmark for Image Editing with a three-level interaction hierarchy that captures deictic grounding, morphological manipulation, and causal reasoning. Across these levels, we curate high-quality and diverse test cases that reflect progressively increasing complexity in visual instruction following. We further propose a robust LMM-as-a-judge evaluation framework with task-specific metrics to enable scalable and fine-grained assessment. Through a comprehensive evaluation of 17 representative open-source and proprietary image editing models, we find that proprietary models exhibit early-stage visual instruction-following capabilities and consistently outperform open-source models. However, performance degrades markedly with increasing task difficulty even for the strongest systems, highlighting promising directions for future research.

模型在多大程度上遵循视觉指令？VIBE：视觉指令驱动图像编辑的系统性基准

How Well Do Models Follow Visual Instructions? VIBE: A Systematic Benchmark for Visual Instruction-Driven Image Editing

摘要

Support