マスクからピクセル、意味へ：VLM画像改ざんの新たな分類法、ベンチマーク、評価指標

要旨

既存の改ざん検出ベンチマークの多くはオブジェクトマスクに依存しており、真の編集信号と深刻に乖離している。マスク内の多くの画素は未変更または僅かな修正しか加えられていない一方で、マスク外の微妙だが重要な編集は自然なものとして扱われている。我々はVLM画像改ざん検出を、粗い領域ラベルから画素単位で根拠づけられ、意味と言語を意識したタスクへと再定義する。第一に、編集プリミティブ（置換/削除/スプライス/修復/属性変更/色付けなど）と改ざん対象の意味的クラスにまたがる分類体系を導入し、低レベルの変化と高レベルの理解を結びつける。第二に、画素単位の改ざんマップと対応するカテゴリ監督を備えた新たなベンチマークを公開し、統一プロトコル下での検出と分類を評価する。第三に、真の編集強度に対する信頼度や予測を評価するための位置情報を考慮した画素レベルの正確性を定量化する訓練フレームワークと評価指標を提案し、さらに意味を意識した分類と予測領域に対する自然言語記述を通じて改ざんの意味理解を測定する。また、最近の強力な改ざん検出器を用いて既存の強力なセグメンテーション/位置特定ベースラインを再評価し、マスクのみの指標による過大評価・過小評価を明らかにするとともに、微細な編集やマスク外の変化における失敗モードを暴露する。本フレームワークは、マスクから画素、意味、言語記述へと分野を進展させ、改ざん位置特定、意味的分類、記述に対する厳密な標準を確立する。コードとベンチマークデータはhttps://github.com/VILA-Lab/PIXAR で公開されている。

English

Existing tampering detection benchmarks largely rely on object masks, which severely misalign with the true edit signal: many pixels inside a mask are untouched or only trivially modified, while subtle yet consequential edits outside the mask are treated as natural. We reformulate VLM image tampering from coarse region labels to a pixel-grounded, meaning and language-aware task. First, we introduce a taxonomy spanning edit primitives (replace/remove/splice/inpaint/attribute/colorization, etc.) and their semantic class of tampered object, linking low-level changes to high-level understanding. Second, we release a new benchmark with per-pixel tamper maps and paired category supervision to evaluate detection and classification within a unified protocol. Third, we propose a training framework and evaluation metrics that quantify pixel-level correctness with localization to assess confidence or prediction on true edit intensity, and further measure tamper meaning understanding via semantics-aware classification and natural language descriptions for the predicted regions. We also re-evaluate the existing strong segmentation/localization baselines on recent strong tamper detectors and reveal substantial over- and under-scoring using mask-only metrics, and expose failure modes on micro-edits and off-mask changes. Our framework advances the field from masks to pixels, meanings and language descriptions, establishing a rigorous standard for tamper localization, semantic classification and description. Code and benchmark data are available at https://github.com/VILA-Lab/PIXAR.

マスクからピクセル、意味へ：VLM画像改ざんの新たな分類法、ベンチマーク、評価指標

From Masks to Pixels and Meaning: A New Taxonomy, Benchmark, and Metrics for VLM Image Tampering

要旨

Support