UniREditBench:一個基於統一推理的影像編輯基準測試集
UniREditBench: A Unified Reasoning-based Image Editing Benchmark
November 3, 2025
作者: Feng Han, Yibin Wang, Chenglin Li, Zheming Liang, Dianyi Wang, Yang Jiao, Zhipeng Wei, Chao Gong, Cheng Jin, Jingjing Chen, Jiaqi Wang
cs.AI
摘要
近期多模態生成模型的進展顯著提升了圖像編輯能力。然而,現有生成模型在處理需要隱式推理的多樣化複雜圖像編輯任務時仍面臨挑戰,這凸顯了建立系統性評估模型在不同推理場景下表現的綜合基準的必要性。現有基準主要聚焦於現實場景中的單一物件屬性轉換,雖具實效性卻存在兩大關鍵侷限:(1)普遍忽略多物件互動及涉及人為規則的遊戲世界場景,而這類場景在實際應用中十分常見;(2)僅依賴文本參照評估生成圖像,可能在複雜推理場景中導致系統性誤判。為此,本研究提出統一推理式圖像編輯評估基準UniREditBench,包含2,700個精心策劃的樣本,涵蓋現實與遊戲世界場景的8個主維度及18個子維度。為提升評估可靠性,我們引入多模態雙參照評估機制,為每個樣本提供文本與真實圖像雙重參照。此外,我們設計了自動化多場景數據合成流程,構建包含高質量思維鏈推理標註的大規模合成數據集UniREdit-Data-100K。通過在該數據集上微調Bagel模型,我們開發出UniREdit-Bagel,其在領域內與分佈外場景均展現顯著性能提升。透過對開源與閉源圖像編輯模型的全面基準測試,我們揭示了各模型在不同維度的優劣勢。
English
Recent advances in multi-modal generative models have driven substantial
improvements in image editing. However, current generative models still
struggle with handling diverse and complex image editing tasks that require
implicit reasoning, underscoring the need for a comprehensive benchmark to
systematically assess their performance across various reasoning scenarios.
Existing benchmarks primarily focus on single-object attribute transformation
in realistic scenarios, which, while effective, encounter two key challenges:
(1) they largely overlook multi-object interactions as well as game-world
scenarios that involve human-defined rules, which are common in real-life
applications; (2) they only rely on textual references to evaluate the
generated images, potentially leading to systematic misjudgments, especially in
complex reasoning scenarios. To this end, this work proposes UniREditBench, a
unified benchmark for reasoning-based image editing evaluation. It comprises
2,700 meticulously curated samples, covering both real- and game-world
scenarios across 8 primary dimensions and 18 sub-dimensions. To improve
evaluation reliability, we introduce multimodal dual-reference evaluation,
providing both textual and ground-truth image references for each sample
assessment. Furthermore, we design an automated multi-scenario data synthesis
pipeline and construct UniREdit-Data-100K, a large-scale synthetic dataset with
high-quality chain-of-thought (CoT) reasoning annotations. We fine-tune Bagel
on this dataset and develop UniREdit-Bagel, demonstrating substantial
improvements in both in-domain and out-of-distribution settings. Through
thorough benchmarking of both open-source and closed-source image editing
models, we reveal their strengths and weaknesses across various aspects.