IA-T2I:互联网增强型文本到图像生成
IA-T2I: Internet-Augmented Text-to-Image Generation
May 21, 2025
作者: Chuanhao Li, Jianwen Sun, Yukang Feng, Mingliang Zhai, Yifan Chang, Kaipeng Zhang
cs.AI
摘要
当前的文本到图像(T2I)生成模型虽取得了显著成果,但在处理文本提示中隐含知识不确定的场景时仍显不足。例如,二月发布的T2I模型难以生成四月上映电影的海报,因为角色设计与风格对模型而言尚不明确。为解决此问题,我们提出了一种互联网增强的文本到图像生成(IA-T2I)框架,通过提供参考图像,使T2I模型能够明确此类不确定知识。具体而言,该框架包含一个主动检索模块,用于根据给定文本提示判断是否需要参考图像;引入了一个分层图像选择模块,以从图像搜索引擎返回的结果中筛选最合适的图像来增强T2I模型;并提出了自我反思机制,持续评估和优化生成图像,确保其与文本提示忠实对齐。为评估所提框架的性能,我们收集了一个名为Img-Ref-T2I的数据集,其中文本提示包含三类不确定知识:(1)已知但罕见;(2)未知;(3)模糊。此外,我们精心设计了一个复杂提示,指导GPT-4o进行偏好评估,其评估准确度已证明与人类偏好评估相近。实验结果表明,我们的框架在人类评估中表现优异,较GPT-4o提升了约30%。
English
Current text-to-image (T2I) generation models achieve promising results, but
they fail on the scenarios where the knowledge implied in the text prompt is
uncertain. For example, a T2I model released in February would struggle to
generate a suitable poster for a movie premiering in April, because the
character designs and styles are uncertain to the model. To solve this problem,
we propose an Internet-Augmented text-to-image generation (IA-T2I) framework to
compel T2I models clear about such uncertain knowledge by providing them with
reference images. Specifically, an active retrieval module is designed to
determine whether a reference image is needed based on the given text prompt; a
hierarchical image selection module is introduced to find the most suitable
image returned by an image search engine to enhance the T2I model; a
self-reflection mechanism is presented to continuously evaluate and refine the
generated image to ensure faithful alignment with the text prompt. To evaluate
the proposed framework's performance, we collect a dataset named Img-Ref-T2I,
where text prompts include three types of uncertain knowledge: (1) known but
rare. (2) unknown. (3) ambiguous. Moreover, we carefully craft a complex prompt
to guide GPT-4o in making preference evaluation, which has been shown to have
an evaluation accuracy similar to that of human preference evaluation.
Experimental results demonstrate the effectiveness of our framework,
outperforming GPT-4o by about 30% in human evaluation.Summary
AI-Generated Summary