ChatPaper.aiChatPaper

模型遵循视觉指令的能力如何?VIBE:视觉指令驱动图像编辑的系统性基准测试

How Well Do Models Follow Visual Instructions? VIBE: A Systematic Benchmark for Visual Instruction-Driven Image Editing

February 2, 2026
作者: Huanyu Zhang, Xuehai Bai, Chengzu Li, Chen Liang, Haochen Tian, Haodong Li, Ruichuan An, Yifan Zhang, Anna Korhonen, Zhang Zhang, Liang Wang, Tieniu Tan
cs.AI

摘要

近期生成模型在图像编辑领域取得了显著进展。然而现有系统和基准测试仍主要基于文本引导。相比之下,人类交流本质上是多模态的,其中诸如草图之类的视觉指令能有效传递空间和结构意图。为弥补这一差距,我们推出了VIBE(视觉指令图像编辑基准),该基准采用三级交互层次结构,涵盖指示性定位、形态操控和因果推理。基于这三个层级,我们构建了高质量且多样化的测试案例,体现视觉指令跟随任务中逐级递增的复杂性。我们进一步提出基于大语言模型的评估框架,配备任务特定指标,以实现可扩展的细粒度评估。通过对17个代表性开源和商业图像编辑模型的综合评估,我们发现商业模型已具备初级视觉指令跟随能力,且始终优于开源模型。但随着任务难度增加,即使是最强大的系统性能也会显著下降,这为未来研究指明了富有前景的方向。
English
Recent generative models have achieved remarkable progress in image editing. However, existing systems and benchmarks remain largely text-guided. In contrast, human communication is inherently multimodal, where visual instructions such as sketches efficiently convey spatial and structural intent. To address this gap, we introduce VIBE, the Visual Instruction Benchmark for Image Editing with a three-level interaction hierarchy that captures deictic grounding, morphological manipulation, and causal reasoning. Across these levels, we curate high-quality and diverse test cases that reflect progressively increasing complexity in visual instruction following. We further propose a robust LMM-as-a-judge evaluation framework with task-specific metrics to enable scalable and fine-grained assessment. Through a comprehensive evaluation of 17 representative open-source and proprietary image editing models, we find that proprietary models exhibit early-stage visual instruction-following capabilities and consistently outperform open-source models. However, performance degrades markedly with increasing task difficulty even for the strongest systems, highlighting promising directions for future research.
PDF162February 7, 2026