GEditBench v2：面向通用图像编辑的人类对齐基准测试集

摘要

近期图像编辑技术的进步使得模型能够以惊人的真实感处理复杂指令。然而现有评估体系却相对滞后：当前基准测试存在任务覆盖范围狭窄的问题，而标准指标难以充分捕捉视觉一致性，即编辑后图像与原始图像在身份特征、结构布局和语义连贯性方面的保持程度。为解决这些局限，我们推出GEditBench v2——一个包含1,200个真实用户查询的综合性基准测试，涵盖23类编辑任务，其中特别设置了开放集类别以容纳超出预设任务范畴的无约束、分布外编辑指令。此外，我们提出PVC-Judge这一基于成对比较的开源视觉一致性评估模型，该模型通过两种新颖的区域解耦偏好数据合成流程进行训练。同时，我们利用专家标注的偏好对构建VCReward-Bench数据集，用以验证PVC-Judge在视觉一致性评估方面与人类判断的契合度。实验表明，我们的PVC-Judge在开源模型中实现了最先进的评估性能，平均表现甚至超越GPT-5.1。最终通过对16个前沿编辑模型的基准测试，我们证明GEditBench v2能实现更符合人类感知的评估，揭示当前模型的关键局限，为推进精准图像编辑技术提供可靠基础。

English

Recent advances in image editing have enabled models to handle complex instructions with impressive realism. However, existing evaluation frameworks lag behind: current benchmarks suffer from narrow task coverage, while standard metrics fail to adequately capture visual consistency, i.e., the preservation of identity, structure and semantic coherence between edited and original images. To address these limitations, we introduce GEditBench v2, a comprehensive benchmark with 1,200 real-world user queries spanning 23 tasks, including a dedicated open-set category for unconstrained, out-of-distribution editing instructions beyond predefined tasks. Furthermore, we propose PVC-Judge, an open-source pairwise assessment model for visual consistency, trained via two novel region-decoupled preference data synthesis pipelines. Besides, we construct VCReward-Bench using expert-annotated preference pairs to assess the alignment of PVC-Judge with human judgments on visual consistency evaluation. Experiments show that our PVC-Judge achieves state-of-the-art evaluation performance among open-source models and even surpasses GPT-5.1 on average. Finally, by benchmarking 16 frontier editing models, we show that GEditBench v2 enables more human-aligned evaluation, revealing critical limitations of current models, and providing a reliable foundation for advancing precise image editing.

GEditBench v2：面向通用图像编辑的人类对齐基准测试集

GEditBench v2: A Human-Aligned Benchmark for General Image Editing

摘要

Support