GIE-Bench：迈向基于文本引导图像编辑的评估基准

摘要

利用自然语言指令编辑图像已成为一种自然而富有表现力的视觉内容修改方式；然而，评估此类模型的性能仍具挑战性。现有评估方法多依赖于如CLIP等图像-文本相似度度量，这些方法缺乏精确性。在本研究中，我们引入了一个新基准，旨在更扎实地评估文本引导的图像编辑模型，聚焦于两个关键维度：(i) 功能正确性，通过自动生成的多项选择题来验证预期修改是否成功实施；(ii) 图像内容保持度，采用对象感知掩码技术和保持评分，确保图像非目标区域在视觉上保持一致。该基准包含超过1000个高质量编辑示例，涵盖20个多样化内容类别，每个示例均附有详细的编辑指令、评估问题及空间对象掩码。我们开展了一项大规模研究，将文本引导图像编辑领域的最新旗舰模型GPT-Image-1与多个顶尖编辑模型进行对比，并验证了我们的自动指标与人工评分的一致性。结果显示，GPT-Image-1在指令遵循准确性上领先，但常过度修改无关图像区域，揭示了当前模型行为中的一个关键权衡。GIE-Bench为推进文本引导图像编辑更精准的评估提供了一个可扩展、可复现的框架。

English

Editing images using natural language instructions has become a natural and expressive way to modify visual content; yet, evaluating the performance of such models remains challenging. Existing evaluation approaches often rely on image-text similarity metrics like CLIP, which lack precision. In this work, we introduce a new benchmark designed to evaluate text-guided image editing models in a more grounded manner, along two critical dimensions: (i) functional correctness, assessed via automatically generated multiple-choice questions that verify whether the intended change was successfully applied; and (ii) image content preservation, which ensures that non-targeted regions of the image remain visually consistent using an object-aware masking technique and preservation scoring. The benchmark includes over 1000 high-quality editing examples across 20 diverse content categories, each annotated with detailed editing instructions, evaluation questions, and spatial object masks. We conduct a large-scale study comparing GPT-Image-1, the latest flagship in the text-guided image editing space, against several state-of-the-art editing models, and validate our automatic metrics against human ratings. Results show that GPT-Image-1 leads in instruction-following accuracy, but often over-modifies irrelevant image regions, highlighting a key trade-off in the current model behavior. GIE-Bench provides a scalable, reproducible framework for advancing more accurate evaluation of text-guided image editing.