从掩码到像素与语义:VLM图像篡改的新分类体系、基准及评估指标
From Masks to Pixels and Meaning: A New Taxonomy, Benchmark, and Metrics for VLM Image Tampering
March 20, 2026
作者: Xinyi Shang, Yi Tang, Jiacheng Cui, Ahmed Elhagry, Salwa K. Al Khatib, Sondos Mahmoud Bsharat, Jiacheng Liu, Xiaohan Zhao, Jing-Hao Xue, Hao Li, Salman Khan, Zhiqiang Shen
cs.AI
摘要
现有的篡改检测基准主要依赖物体掩码,这与真实编辑信号存在严重偏差:掩码内的许多像素未被修改或仅轻微改动,而掩码外细微但关键的篡改却被视为自然图像。我们将VLM图像篡改检测重新定义为从粗粒度区域标注转向像素级锚定、语义与语言感知的任务。首先,我们建立了涵盖编辑基本类型(替换/移除/拼接/修复/属性修改/色彩调整等)及其篡改对象语义类别的分类体系,将底层视觉变化与高层语义理解相连接。其次,我们发布了包含逐像素篡改图谱和配对类别标注的新基准,通过统一协议评估检测与分类性能。第三,我们提出了量化像素级正确率的训练框架与评估指标:通过定位置信度或真实编辑强度的预测来评估检测效果,并借助语义感知分类和自然语言描述来度量对篡改含义的理解。我们还在最新强效篡改检测器上重新评估了现有分割/定位基线,发现仅使用掩码指标会导致严重的高估或低估,同时揭示了微篡改和掩码外修改的失效模式。我们的框架推动该领域从掩码检测迈向像素级、语义化和语言描述的新阶段,为篡改定位、语义分类和描述建立了严谨标准。代码与基准数据详见https://github.com/VILA-Lab/PIXAR。
English
Existing tampering detection benchmarks largely rely on object masks, which severely misalign with the true edit signal: many pixels inside a mask are untouched or only trivially modified, while subtle yet consequential edits outside the mask are treated as natural. We reformulate VLM image tampering from coarse region labels to a pixel-grounded, meaning and language-aware task. First, we introduce a taxonomy spanning edit primitives (replace/remove/splice/inpaint/attribute/colorization, etc.) and their semantic class of tampered object, linking low-level changes to high-level understanding. Second, we release a new benchmark with per-pixel tamper maps and paired category supervision to evaluate detection and classification within a unified protocol. Third, we propose a training framework and evaluation metrics that quantify pixel-level correctness with localization to assess confidence or prediction on true edit intensity, and further measure tamper meaning understanding via semantics-aware classification and natural language descriptions for the predicted regions. We also re-evaluate the existing strong segmentation/localization baselines on recent strong tamper detectors and reveal substantial over- and under-scoring using mask-only metrics, and expose failure modes on micro-edits and off-mask changes. Our framework advances the field from masks to pixels, meanings and language descriptions, establishing a rigorous standard for tamper localization, semantic classification and description. Code and benchmark data are available at https://github.com/VILA-Lab/PIXAR.