ChatPaper.aiChatPaper

UniREditBench:基于统一推理的图像编辑基准

UniREditBench: A Unified Reasoning-based Image Editing Benchmark

November 3, 2025
作者: Feng Han, Yibin Wang, Chenglin Li, Zheming Liang, Dianyi Wang, Yang Jiao, Zhipeng Wei, Chao Gong, Cheng Jin, Jingjing Chen, Jiaqi Wang
cs.AI

摘要

多模态生成模型的最新进展显著推动了图像编辑技术的提升。然而,当前生成模型在处理需要隐式推理的多样化复杂图像编辑任务时仍存在困难,这凸显了建立系统性评估各类推理场景下模型性能的综合基准的必要性。现有基准主要关注现实场景中的单对象属性转换,虽然有效但面临两大挑战:(1)大多忽略了多对象交互以及涉及人为规则的虚拟场景,而这些在现实应用中十分常见;(2)仅依赖文本参考评估生成图像,可能导致系统性误判,尤其在复杂推理场景中。为此,本研究提出统一推理式图像编辑评估基准UniREditBench,包含2,700个精心构建的样本,覆盖现实与虚拟场景的8个主维度和18个子维度。为提升评估可靠性,我们引入多模态双参考评估机制,为每个样本提供文本和真实图像双重参考。此外,我们设计了自动化多场景数据合成流程,构建了包含高质量思维链推理标注的大规模合成数据集UniREdit-Data-100K。通过在该数据集上微调Bagel模型,我们开发出UniREdit-Bagel,其在域内和域外设置下均展现出显著性能提升。通过对开源与闭源图像编辑模型的全面基准测试,我们揭示了它们在不同维度上的优势与不足。
English
Recent advances in multi-modal generative models have driven substantial improvements in image editing. However, current generative models still struggle with handling diverse and complex image editing tasks that require implicit reasoning, underscoring the need for a comprehensive benchmark to systematically assess their performance across various reasoning scenarios. Existing benchmarks primarily focus on single-object attribute transformation in realistic scenarios, which, while effective, encounter two key challenges: (1) they largely overlook multi-object interactions as well as game-world scenarios that involve human-defined rules, which are common in real-life applications; (2) they only rely on textual references to evaluate the generated images, potentially leading to systematic misjudgments, especially in complex reasoning scenarios. To this end, this work proposes UniREditBench, a unified benchmark for reasoning-based image editing evaluation. It comprises 2,700 meticulously curated samples, covering both real- and game-world scenarios across 8 primary dimensions and 18 sub-dimensions. To improve evaluation reliability, we introduce multimodal dual-reference evaluation, providing both textual and ground-truth image references for each sample assessment. Furthermore, we design an automated multi-scenario data synthesis pipeline and construct UniREdit-Data-100K, a large-scale synthetic dataset with high-quality chain-of-thought (CoT) reasoning annotations. We fine-tune Bagel on this dataset and develop UniREdit-Bagel, demonstrating substantial improvements in both in-domain and out-of-distribution settings. Through thorough benchmarking of both open-source and closed-source image editing models, we reveal their strengths and weaknesses across various aspects.
PDF381January 19, 2026