开放式多模态检索增强事实图像生成
Open Multimodal Retrieval-Augmented Factual Image Generation
October 26, 2025
作者: Yang Tian, Fan Liu, Jingyuan Zhang, Wei Bi, Yupeng Hu, Liqiang Nie
cs.AI
摘要
大型多模态模型在生成逼真且符合提示要求的图像方面取得了显著进展,但其输出结果常与可验证知识相矛盾,尤其在涉及细粒度属性或时效性事件的提示场景下。传统检索增强方法试图通过引入外部信息解决此问题,但由于依赖静态知识源和浅层证据整合机制,本质上无法将生成过程锚定于准确且动态演进的知识。为弥补这一缺陷,我们提出ORIG——一种面向事实性图像生成任务的智能开放式多模态检索增强框架。该框架通过迭代式网络多模态证据检索与过滤机制,将精炼知识逐步整合至增强提示中以指导图像生成。为支持系统性评估,我们构建了FIG-Eval基准数据集,涵盖感知、组合及时态三大维度的十个类别。实验表明,ORIG在事实一致性与整体图像质量上显著优于现有强基线模型,彰显了开放式多模态检索在事实性图像生成领域的应用潜力。
English
Large Multimodal Models (LMMs) have achieved remarkable progress in
generating photorealistic and prompt-aligned images, but they often produce
outputs that contradict verifiable knowledge, especially when prompts involve
fine-grained attributes or time-sensitive events. Conventional
retrieval-augmented approaches attempt to address this issue by introducing
external information, yet they are fundamentally incapable of grounding
generation in accurate and evolving knowledge due to their reliance on static
sources and shallow evidence integration. To bridge this gap, we introduce
ORIG, an agentic open multimodal retrieval-augmented framework for Factual
Image Generation (FIG), a new task that requires both visual realism and
factual grounding. ORIG iteratively retrieves and filters multimodal evidence
from the web and incrementally integrates the refined knowledge into enriched
prompts to guide generation. To support systematic evaluation, we build
FIG-Eval, a benchmark spanning ten categories across perceptual, compositional,
and temporal dimensions. Experiments demonstrate that ORIG substantially
improves factual consistency and overall image quality over strong baselines,
highlighting the potential of open multimodal retrieval for factual image
generation.