通过多模态大语言模型中的基础推理实现可解释且可靠的AI生成图像检测
Interpretable and Reliable Detection of AI-Generated Images via Grounded Reasoning in MLLMs
June 8, 2025
作者: Yikun Ji, Hong Yan, Jun Lan, Huijia Zhu, Weiqiang Wang, Qi Fan, Liqing Zhang, Jianfu Zhang
cs.AI
摘要
图像生成技术的飞速发展,加大了对可解释且鲁棒的检测方法的需求。尽管现有方法通常能达到较高的准确率,但它们往往作为黑箱运行,无法提供人类可理解的解释。多模态大语言模型(MLLMs)虽非专为伪造检测设计,却展现出强大的分析和推理能力。经过适当微调后,它们能有效识别AI生成的图像,并提供有意义的解释。然而,现有的MLLMs仍存在幻觉问题,其视觉理解常与实际图像内容及人类推理不符。为弥合这一差距,我们构建了一个包含边界框和描述性标注的AI生成图像数据集,这些标注突出了合成痕迹,为人类对齐的视觉-文本基础推理奠定了基础。随后,我们通过多阶段优化策略对MLLMs进行微调,逐步平衡准确检测、视觉定位和连贯文本解释的目标。最终模型在检测AI生成图像及定位视觉缺陷方面均表现出色,显著超越了基线方法。
English
The rapid advancement of image generation technologies intensifies the demand
for interpretable and robust detection methods. Although existing approaches
often attain high accuracy, they typically operate as black boxes without
providing human-understandable justifications. Multi-modal Large Language
Models (MLLMs), while not originally intended for forgery detection, exhibit
strong analytical and reasoning capabilities. When properly fine-tuned, they
can effectively identify AI-generated images and offer meaningful explanations.
However, existing MLLMs still struggle with hallucination and often fail to
align their visual interpretations with actual image content and human
reasoning. To bridge this gap, we construct a dataset of AI-generated images
annotated with bounding boxes and descriptive captions that highlight synthesis
artifacts, establishing a foundation for human-aligned visual-textual grounded
reasoning. We then finetune MLLMs through a multi-stage optimization strategy
that progressively balances the objectives of accurate detection, visual
localization, and coherent textual explanation. The resulting model achieves
superior performance in both detecting AI-generated images and localizing
visual flaws, significantly outperforming baseline methods.