GEditBench v2：一個對齊人類標準的通用圖像編輯基準測試

摘要

近期圖像編輯技術的進步使模型能夠以驚人的真實度處理複雜指令。然而現有評估框架卻相對滯後：當前基準測試存在任務覆蓋面狹窄的問題，而標準指標亦未能充分捕捉視覺一致性，即編輯後圖像與原始圖像在身份特徵、結構佈局和語義連貫性方面的保持程度。為解決這些局限，我們推出GEditBench v2——一個包含1,200個真實用戶查詢的綜合基準測試，涵蓋23類任務，其中特別設立開放集類別，用於處理預定義任務之外的無約束、分佈外編輯指令。此外，我們提出PVC-Judge這一開源視覺一致性成對評估模型，該模型通過兩條新穎的區域解耦偏好數據合成流程進行訓練。同時，我們基於專家標註的偏好對構建VCReward-Bench，用以驗證PVC-Judge在視覺一致性評估中與人類判斷的吻合度。實驗表明，我們的PVC-Judge在開源模型中實現了最先進的評估性能，平均表現甚至超越GPT-5.1。最後通過對16個前沿編輯模型的基準測試，我們證明GEditBench v2能實現更貼近人類認知的評估，揭示當前模型的關鍵局限，為推進精確圖像編輯技術提供可靠基礎。

English

Recent advances in image editing have enabled models to handle complex instructions with impressive realism. However, existing evaluation frameworks lag behind: current benchmarks suffer from narrow task coverage, while standard metrics fail to adequately capture visual consistency, i.e., the preservation of identity, structure and semantic coherence between edited and original images. To address these limitations, we introduce GEditBench v2, a comprehensive benchmark with 1,200 real-world user queries spanning 23 tasks, including a dedicated open-set category for unconstrained, out-of-distribution editing instructions beyond predefined tasks. Furthermore, we propose PVC-Judge, an open-source pairwise assessment model for visual consistency, trained via two novel region-decoupled preference data synthesis pipelines. Besides, we construct VCReward-Bench using expert-annotated preference pairs to assess the alignment of PVC-Judge with human judgments on visual consistency evaluation. Experiments show that our PVC-Judge achieves state-of-the-art evaluation performance among open-source models and even surpasses GPT-5.1 on average. Finally, by benchmarking 16 frontier editing models, we show that GEditBench v2 enables more human-aligned evaluation, revealing critical limitations of current models, and providing a reliable foundation for advancing precise image editing.

GEditBench v2：一個對齊人類標準的通用圖像編輯基準測試

GEditBench v2: A Human-Aligned Benchmark for General Image Editing

摘要

Support