LEGION: 合成画像検出のための接地と説明を学習する手法

要旨

生成技術の急速な進歩は諸刃の剣として現れている。利便性を高める強力なツールを提供する一方で、重大な社会的懸念も引き起こしている。防御手段として、現在の合成画像検出手法はアーティファクトレベルのテキスト解釈可能性を欠き、画像操作検出に過度に焦点を当てていることが多く、現在のデータセットは通常、時代遅れの生成器と細粒度のアノテーションの欠如に悩まされている。本論文では、SynthScarsを紹介する。これは、人間の専門家によるアノテーションが付いた12,236枚の完全合成画像からなる高品質で多様なデータセットである。4つの異なる画像コンテンツタイプ、3つのカテゴリのアーティファクト、およびピクセルレベルのセグメンテーション、詳細なテキスト説明、アーティファクトカテゴリラベルをカバーする細粒度のアノテーションを特徴としている。さらに、LEGION（LEarning to Ground and explain for Synthetic Image detectiON）を提案する。これは、アーティファクト検出、セグメンテーション、説明を統合したマルチモーダル大規模言語モデル（MLLM）ベースの画像偽造分析フレームワークである。この能力を基盤として、LEGIONをコントローラーとして探索し、画像精緻化パイプラインに統合して、より高品質で現実的な画像の生成を導く。大規模な実験により、LEGIONが複数のベンチマークで既存の手法を上回り、特にSynthScarsにおいて2番目に優れた従来の専門家をmIoUで3.31%、F1スコアで7.75%上回ることが示された。さらに、その指導の下で生成された精緻化画像は、人間の好みとの強い一致を示す。コード、モデル、データセットは公開される予定である。

English

The rapid advancements in generative technology have emerged as a double-edged sword. While offering powerful tools that enhance convenience, they also pose significant social concerns. As defenders, current synthetic image detection methods often lack artifact-level textual interpretability and are overly focused on image manipulation detection, and current datasets usually suffer from outdated generators and a lack of fine-grained annotations. In this paper, we introduce SynthScars, a high-quality and diverse dataset consisting of 12,236 fully synthetic images with human-expert annotations. It features 4 distinct image content types, 3 categories of artifacts, and fine-grained annotations covering pixel-level segmentation, detailed textual explanations, and artifact category labels. Furthermore, we propose LEGION (LEarning to Ground and explain for Synthetic Image detectiON), a multimodal large language model (MLLM)-based image forgery analysis framework that integrates artifact detection, segmentation, and explanation. Building upon this capability, we further explore LEGION as a controller, integrating it into image refinement pipelines to guide the generation of higher-quality and more realistic images. Extensive experiments show that LEGION outperforms existing methods across multiple benchmarks, particularly surpassing the second-best traditional expert on SynthScars by 3.31% in mIoU and 7.75% in F1 score. Moreover, the refined images generated under its guidance exhibit stronger alignment with human preferences. The code, model, and dataset will be released.

LEGION: 合成画像検出のための接地と説明を学習する手法

LEGION: Learning to Ground and Explain for Synthetic Image Detection

要旨

Support