DLEBench：基于指令的图像编辑模型小尺度物体编辑能力评估基准

摘要

基于指令的图像编辑模型（IIEMs）领域已取得显著进展。然而，尽管现有基准测试表明这些模型在遵循指令和推理能力方面表现优异，但其对小物体的编辑能力仍待深入探索——这种能力对于真实图像和生成图像中局部精确编辑与细节优化至关重要。本文提出首个专注于评估IIEMs小尺度物体编辑能力的基准测试DeepLookEditBench（DLEBench）。具体而言，我们构建了包含七类指令类型、共1889个样本的挑战性测试集，其中目标物体仅占图像面积1%-10%，覆盖部分遮挡和多物体编辑等复杂场景。为确保评估可靠性，我们提出包含细化评分标准的评估方案，从指令遵循度和视觉一致性两个维度最小化主观判断偏差。该方案还引入双模评估框架（工具驱动模式与先知引导模式），以解决LMM-as-a-Judge评估方式与人类评判在DLEBench上的认知差异。对10个IIEMs的实证研究揭示了现有模型在小尺度物体编辑上的显著性能差距，表明需要专项基准测试推动该能力发展。

English

Significant progress has been made in the field of Instruction-based Image Editing Models (IIEMs). However, while these models demonstrate plausible adherence to instructions and strong reasoning ability on current benchmarks, their ability to edit small objects remains underexplored, despite its importance for precise local editing and refining details in both real and generated images. In this paper, we introduce DeepLookEditBench (DLEBench), the first benchmark dedicated to assessing the abilities of IIEMs in editing small-scale objects. Specifically, we construct a challenging testbed comprising 1889 samples across seven instruction types. In these samples, target objects occupy only 1%-10% of the image area, covering complex scenarios such as partial occlusion and multi-object editing. To ensure robust evaluation on this benchmark, we propose an evaluation protocol with refined score rubrics to minimize subjectivity and ambiguity in two criteria: Instruction Following and Visual Consistency. This protocol also introduces a dual-mode evaluation framework (Tool-driven and Oracle-guided Modes) addressing the misalignment between LMM-as-a-Judge and human judgements on DLEBench. Empirical results on 10 IIEMs reveal significant performance gaps in small-scale object editing, highlighting the need for specialized benchmarks to advance this ability.

DLEBench：基于指令的图像编辑模型小尺度物体编辑能力评估基准

DLEBench: Evaluating Small-scale Object Editing Ability for Instruction-based Image Editing Model

摘要

Support