PaintBench：精准视觉編輯的確定性評估

摘要

儘管當前的多模態模型在開放式視覺編輯方面表現出色，但執行精確的單一答案編輯仍是一項重要障礙。為探究此挑戰，我們提出PaintBench，這是一個可動態擴展的基準測試，針對20項基礎精確視覺編輯操作，涵蓋四大類別：幾何變換、結構操控、顏色變更及符號推理。透過可配置複雜度的程序化生成，本基準能產生實際上無限且抗污染的評估套件，而確定性像素級評估則消除了對易有偏見的評判模型的依賴。在11個圖像編輯模型中，我們發現整體表現低落，目前表現最佳的業界領導者僅達17.1%（mIoU）。任務分解揭示了特別具挑戰性的操作類型（幾何變換、大多數結構操控、基於公式的顏色變更）以及模型特定的專業領域。細粒度的基準診斷進一步顯示，由物件數量、背景複雜度、色彩配置及編輯區域大小等場景變化所引發的效能衰減。為測試PaintBench分數在應用任務表現上的泛化能力，我們建立了一個用於數據可視化編輯的程序化確定性評估（TinyGrafixBench），並發現其與PaintBench分數存在強線性相關（R^2 = 0.91, p < 0.001）。總體而言，PaintBench為衡量與推動精確多模態視覺編輯的進展提供了嚴謹的基礎。

English

While current multimodal models are proficient at open-ended visual editing, executing precise single-answer edits remains an important obstacle. To probe this challenge, we introduce PaintBench, a dynamically scalable benchmark targeting 20 fundamental precise visual editing operations across four categories: geometric transformation, structural manipulation, color change, and symbolic reasoning. Procedural generation with configurable complexity enables an effectively infinite, contamination-resistant evaluation suite, and deterministic pixel-level evaluation eliminates reliance on bias-prone judge models. Across 11 image editing models, we find overall low performance, with the current highest-performing industry leader scoring only 17.1% (mIoU). Task decomposition reveals especially challenging operation types (geometric transformation, most structural manipulation, formula-based color change) and model-specific specializations. Fine-grained benchmark diagnostics further show performance degradations induced by scene variations in object count, background complexity, color scheme, and edit-region size. To test generalization of PaintBench scores to applied task performance, we create a procedural, deterministic evaluation for data visualization editing (TinyGrafixBench) and find strong linear correlation with PaintBench scores (R^2 = 0.91, p < 0.001). Altogether, PaintBench provides a rigorous foundation for measuring and driving progress in precise multimodal visual editing.