通过多模态大语言模型中的基础推理实现可解释且可靠的AI生成图像检测

摘要

图像生成技术的飞速发展，加大了对可解释且鲁棒的检测方法的需求。尽管现有方法通常能达到较高的准确率，但它们往往作为黑箱运行，无法提供人类可理解的解释。多模态大语言模型（MLLMs）虽非专为伪造检测设计，却展现出强大的分析和推理能力。经过适当微调后，它们能有效识别AI生成的图像，并提供有意义的解释。然而，现有的MLLMs仍存在幻觉问题，其视觉理解常与实际图像内容及人类推理不符。为弥合这一差距，我们构建了一个包含边界框和描述性标注的AI生成图像数据集，这些标注突出了合成痕迹，为人类对齐的视觉-文本基础推理奠定了基础。随后，我们通过多阶段优化策略对MLLMs进行微调，逐步平衡准确检测、视觉定位和连贯文本解释的目标。最终模型在检测AI生成图像及定位视觉缺陷方面均表现出色，显著超越了基线方法。

English

The rapid advancement of image generation technologies intensifies the demand for interpretable and robust detection methods. Although existing approaches often attain high accuracy, they typically operate as black boxes without providing human-understandable justifications. Multi-modal Large Language Models (MLLMs), while not originally intended for forgery detection, exhibit strong analytical and reasoning capabilities. When properly fine-tuned, they can effectively identify AI-generated images and offer meaningful explanations. However, existing MLLMs still struggle with hallucination and often fail to align their visual interpretations with actual image content and human reasoning. To bridge this gap, we construct a dataset of AI-generated images annotated with bounding boxes and descriptive captions that highlight synthesis artifacts, establishing a foundation for human-aligned visual-textual grounded reasoning. We then finetune MLLMs through a multi-stage optimization strategy that progressively balances the objectives of accurate detection, visual localization, and coherent textual explanation. The resulting model achieves superior performance in both detecting AI-generated images and localizing visual flaws, significantly outperforming baseline methods.

通过多模态大语言模型中的基础推理实现可解释且可靠的AI生成图像检测

Interpretable and Reliable Detection of AI-Generated Images via Grounded Reasoning in MLLMs

摘要

Support