PaintBench:精确视觉编辑的确定性评估
PaintBench: Deterministic Evaluation of Precise Visual Editing
May 29, 2026
作者: Kai Xu, Ellis Brown, Shrikar Madhu, Rob Fergus, He He, Saining Xie
cs.AI
摘要
尽管当前的多模态模型在开放式视觉编辑方面表现熟练,但实现精确的单答案编辑仍是一个重要障碍。为探究这一挑战,我们提出了PaintBench——一个可动态扩展的基准测试,聚焦于四大类共20种基础精确视觉编辑操作:几何变换、结构操控、颜色变化和符号推理。通过可配置复杂度的程序化生成,我们实现了无限且抗污染的评估套件,并结合确定性像素级评估,摒弃了易产生偏倚的评判模型。在11个图像编辑模型上,我们发现整体性能较低,当前表现最佳的行业领先模型仅达到17.1%(mIoU)。任务分解揭示了特别具有挑战性的操作类型(几何变换、多数结构操控、基于公式的颜色变化)以及模型专属的特化倾向。精细化的基准诊断进一步显示,场景变化(如目标数量、背景复杂度、配色方案和编辑区域大小)会引发性能下降。为检验PaintBench分数在应用任务性能上的泛化能力,我们创建了面向数据可视化编辑的程序化确定性评估(TinyGrafixBench),发现其与PaintBench分数呈现强线性相关(R² = 0.91,p < 0.001)。总体而言,PaintBench为衡量并推动精确多模态视觉编辑的进步奠定了严谨基础。
English
While current multimodal models are proficient at open-ended visual editing, executing precise single-answer edits remains an important obstacle. To probe this challenge, we introduce PaintBench, a dynamically scalable benchmark targeting 20 fundamental precise visual editing operations across four categories: geometric transformation, structural manipulation, color change, and symbolic reasoning. Procedural generation with configurable complexity enables an effectively infinite, contamination-resistant evaluation suite, and deterministic pixel-level evaluation eliminates reliance on bias-prone judge models. Across 11 image editing models, we find overall low performance, with the current highest-performing industry leader scoring only 17.1% (mIoU). Task decomposition reveals especially challenging operation types (geometric transformation, most structural manipulation, formula-based color change) and model-specific specializations. Fine-grained benchmark diagnostics further show performance degradations induced by scene variations in object count, background complexity, color scheme, and edit-region size. To test generalization of PaintBench scores to applied task performance, we create a procedural, deterministic evaluation for data visualization editing (TinyGrafixBench) and find strong linear correlation with PaintBench scores (R^2 = 0.91, p < 0.001). Altogether, PaintBench provides a rigorous foundation for measuring and driving progress in precise multimodal visual editing.