DLEBench：评估基于指令的图像编辑模型的小规模物体编辑能力

摘要

基于指令的图像编辑模型（IIEMs）领域已取得显著进展。然而，尽管现有基准测试表明这类模型在遵循指令和推理能力方面表现优异，但其对小尺度物体的编辑能力仍待深入探索——这种能力对于真实图像和生成图像中的局部精准编辑与细节优化至关重要。本文提出首个专注于评估IIEMs小尺度物体编辑能力的基准测试DeepLookEditBench（DLEBench）。具体而言，我们构建了一个包含7类指令类型、总计1889个样本的挑战性测试集，其中目标物体仅占据图像面积的1%-10%，并涵盖部分遮挡、多物体编辑等复杂场景。为确保评估的鲁棒性，我们提出包含精细化评分标准的评估方案，通过双重评判标准（指令遵循度与视觉一致性）最大限度减少主观性和模糊性。该方案还引入了双模式评估框架（工具驱动模式与先知引导模式），以解决LMM-as-a-Judge评判方式与人类评判在DLEBench上存在的偏差。对10个IIEMs的实证研究揭示了当前模型在小尺度物体编辑方面存在的显著性能差距，表明需要专门化的基准测试来推动该能力的发展。

English

Significant progress has been made in the field of Instruction-based Image Editing Models (IIEMs). However, while these models demonstrate plausible adherence to instructions and strong reasoning ability on current benchmarks, their ability to edit small objects remains underexplored, despite its importance for precise local editing and refining details in both real and generated images. In this paper, we introduce DeepLookEditBench (DLEBench), the first benchmark dedicated to assessing the abilities of IIEMs in editing small-scale objects. Specifically, we construct a challenging testbed comprising 1889 samples across seven instruction types. In these samples, target objects occupy only 1%-10% of the image area, covering complex scenarios such as partial occlusion and multi-object editing. To ensure robust evaluation on this benchmark, we propose an evaluation protocol with refined score rubrics to minimize subjectivity and ambiguity in two criteria: Instruction Following and Visual Consistency. This protocol also introduces a dual-mode evaluation framework (Tool-driven and Oracle-guided Modes) addressing the misalignment between LMM-as-a-Judge and human judgements on DLEBench. Empirical results on 10 IIEMs reveal significant performance gaps in small-scale object editing, highlighting the need for specialized benchmarks to advance this ability.