AI生成画像の解釈可能かつ信頼性の高い検出：MLLMにおける根拠に基づく推論

要旨

画像生成技術の急速な進展に伴い、解釈可能で堅牢な検出手法への需要が高まっている。既存のアプローチは高い精度を達成することが多いが、一般的にブラックボックスとして動作し、人間が理解可能な説明を提供しない。マルチモーダル大規模言語モデル（MLLMs）は、偽造検出を当初の目的としていないものの、強力な分析能力と推論能力を示す。適切にファインチューニングを行うことで、AI生成画像を効果的に識別し、意味のある説明を提供することができる。しかし、既存のMLLMsは依然として幻覚（hallucination）に悩まされており、視覚的解釈を実際の画像内容や人間の推論と整合させることができないことが多い。このギャップを埋めるため、合成アーティファクトを強調するバウンディングボックスと記述キャプションで注釈付けされたAI生成画像のデータセットを構築し、人間と整合した視覚的・テキスト的根拠に基づく推論の基盤を確立した。その後、正確な検出、視覚的ローカライゼーション、一貫したテキスト説明の目的を段階的にバランスさせる多段階最適化戦略を通じてMLLMsをファインチューニングした。その結果得られたモデルは、AI生成画像の検出と視覚的欠陥のローカライゼーションの両方において優れた性能を発揮し、ベースライン手法を大幅に上回る結果を示した。

English

The rapid advancement of image generation technologies intensifies the demand for interpretable and robust detection methods. Although existing approaches often attain high accuracy, they typically operate as black boxes without providing human-understandable justifications. Multi-modal Large Language Models (MLLMs), while not originally intended for forgery detection, exhibit strong analytical and reasoning capabilities. When properly fine-tuned, they can effectively identify AI-generated images and offer meaningful explanations. However, existing MLLMs still struggle with hallucination and often fail to align their visual interpretations with actual image content and human reasoning. To bridge this gap, we construct a dataset of AI-generated images annotated with bounding boxes and descriptive captions that highlight synthesis artifacts, establishing a foundation for human-aligned visual-textual grounded reasoning. We then finetune MLLMs through a multi-stage optimization strategy that progressively balances the objectives of accurate detection, visual localization, and coherent textual explanation. The resulting model achieves superior performance in both detecting AI-generated images and localizing visual flaws, significantly outperforming baseline methods.

AI生成画像の解釈可能かつ信頼性の高い検出：MLLMにおける根拠に基づく推論

Interpretable and Reliable Detection of AI-Generated Images via Grounded Reasoning in MLLMs

要旨

Support