ChatPaper.aiChatPaper

PaintBench:精准视觉編輯的確定性評估

PaintBench: Deterministic Evaluation of Precise Visual Editing

May 29, 2026
作者: Kai Xu, Ellis Brown, Shrikar Madhu, Rob Fergus, He He, Saining Xie
cs.AI

摘要

儘管當前的多模態模型在開放式視覺編輯方面表現出色,但執行精確的單一答案編輯仍是一項重要障礙。為探究此挑戰,我們提出PaintBench,這是一個可動態擴展的基準測試,針對20項基礎精確視覺編輯操作,涵蓋四大類別:幾何變換、結構操控、顏色變更及符號推理。透過可配置複雜度的程序化生成,本基準能產生實際上無限且抗污染的評估套件,而確定性像素級評估則消除了對易有偏見的評判模型的依賴。在11個圖像編輯模型中,我們發現整體表現低落,目前表現最佳的業界領導者僅達17.1%(mIoU)。任務分解揭示了特別具挑戰性的操作類型(幾何變換、大多數結構操控、基於公式的顏色變更)以及模型特定的專業領域。細粒度的基準診斷進一步顯示,由物件數量、背景複雜度、色彩配置及編輯區域大小等場景變化所引發的效能衰減。為測試PaintBench分數在應用任務表現上的泛化能力,我們建立了一個用於數據可視化編輯的程序化確定性評估(TinyGrafixBench),並發現其與PaintBench分數存在強線性相關(R^2 = 0.91, p < 0.001)。總體而言,PaintBench為衡量與推動精確多模態視覺編輯的進展提供了嚴謹的基礎。
English
While current multimodal models are proficient at open-ended visual editing, executing precise single-answer edits remains an important obstacle. To probe this challenge, we introduce PaintBench, a dynamically scalable benchmark targeting 20 fundamental precise visual editing operations across four categories: geometric transformation, structural manipulation, color change, and symbolic reasoning. Procedural generation with configurable complexity enables an effectively infinite, contamination-resistant evaluation suite, and deterministic pixel-level evaluation eliminates reliance on bias-prone judge models. Across 11 image editing models, we find overall low performance, with the current highest-performing industry leader scoring only 17.1% (mIoU). Task decomposition reveals especially challenging operation types (geometric transformation, most structural manipulation, formula-based color change) and model-specific specializations. Fine-grained benchmark diagnostics further show performance degradations induced by scene variations in object count, background complexity, color scheme, and edit-region size. To test generalization of PaintBench scores to applied task performance, we create a procedural, deterministic evaluation for data visualization editing (TinyGrafixBench) and find strong linear correlation with PaintBench scores (R^2 = 0.91, p < 0.001). Altogether, PaintBench provides a rigorous foundation for measuring and driving progress in precise multimodal visual editing.