ChatPaper.aiChatPaper

世界到图像:基于智能体驱动世界知识的文本到图像生成

World-To-Image: Grounding Text-to-Image Generation with Agent-Driven World Knowledge

October 5, 2025
作者: Moo Hyun Son, Jintaek Oh, Sun Bin Mun, Jaechul Roh, Sehyun Choi
cs.AI

摘要

尽管文本到图像(T2I)模型能够合成高质量图像,但在面对新颖或分布外(OOD)实体时,由于固有的知识截止点,其性能显著下降。我们引入了“世界到图像”这一创新框架,通过赋予T2I生成以代理驱动的世界知识,弥合了这一差距。我们设计了一个代理,动态搜索网络以检索基础模型未知概念的图像。随后,利用这些信息进行多模态提示优化,引导强大的生成骨干网络实现精确合成。尤为关键的是,我们的评估超越了传统指标,采用LLMGrader和ImageReward等现代评估方法,以衡量真实的语义保真度。实验表明,“世界到图像”在语义对齐和视觉美感上均大幅超越现有最先进方法,在我们精心策划的NICE基准测试中,准确率相对于提示提升了+8.1%。该框架在不到三次迭代中高效达成这些成果,为T2I系统更好地反映瞬息万变的现实世界铺平了道路。我们的演示代码可在此处获取:https://github.com/mhson-kyle/World-To-Image。
English
While text-to-image (T2I) models can synthesize high-quality images, their performance degrades significantly when prompted with novel or out-of-distribution (OOD) entities due to inherent knowledge cutoffs. We introduce World-To-Image, a novel framework that bridges this gap by empowering T2I generation with agent-driven world knowledge. We design an agent that dynamically searches the web to retrieve images for concepts unknown to the base model. This information is then used to perform multimodal prompt optimization, steering powerful generative backbones toward an accurate synthesis. Critically, our evaluation goes beyond traditional metrics, utilizing modern assessments like LLMGrader and ImageReward to measure true semantic fidelity. Our experiments show that World-To-Image substantially outperforms state-of-the-art methods in both semantic alignment and visual aesthetics, achieving +8.1% improvement in accuracy-to-prompt on our curated NICE benchmark. Our framework achieves these results with high efficiency in less than three iterations, paving the way for T2I systems that can better reflect the ever-changing real world. Our demo code is available herehttps://github.com/mhson-kyle/World-To-Image.
PDF42October 14, 2025