世界至圖像:基於智能體驅動的世界知識引導文本至圖像生成
World-To-Image: Grounding Text-to-Image Generation with Agent-Driven World Knowledge
October 5, 2025
作者: Moo Hyun Son, Jintaek Oh, Sun Bin Mun, Jaechul Roh, Sehyun Choi
cs.AI
摘要
尽管文本到图像(T2I)模型能够合成高质量图像,但在面对新颖或分布外(OOD)实体时,其性能会因固有的知识截止点而显著下降。我们引入了“世界到图像”这一创新框架,通过赋予T2I生成以代理驱动的世界知识来弥合这一差距。我们设计了一个代理,动态搜索网络以检索基础模型未知概念的图像。随后,利用这些信息进行多模态提示优化,引导强大的生成骨干网络实现精确合成。尤为关键的是,我们的评估超越了传统指标,采用如LLMGrader和ImageReward等现代评估方法,以衡量真实的语义保真度。实验表明,“世界到图像”在语义对齐和视觉美学两方面均大幅超越现有最先进方法,在我们精心策划的NICE基准测试中,准确率相对于提示提升了+8.1%。我们的框架以高效的方式在不到三次迭代中达成这些成果,为T2I系统更好地反映不断变化的现实世界铺平了道路。我们的演示代码可在此处获取:https://github.com/mhson-kyle/World-To-Image。
English
While text-to-image (T2I) models can synthesize high-quality images, their
performance degrades significantly when prompted with novel or
out-of-distribution (OOD) entities due to inherent knowledge cutoffs. We
introduce World-To-Image, a novel framework that bridges this gap by empowering
T2I generation with agent-driven world knowledge. We design an agent that
dynamically searches the web to retrieve images for concepts unknown to the
base model. This information is then used to perform multimodal prompt
optimization, steering powerful generative backbones toward an accurate
synthesis. Critically, our evaluation goes beyond traditional metrics,
utilizing modern assessments like LLMGrader and ImageReward to measure true
semantic fidelity. Our experiments show that World-To-Image substantially
outperforms state-of-the-art methods in both semantic alignment and visual
aesthetics, achieving +8.1% improvement in accuracy-to-prompt on our curated
NICE benchmark. Our framework achieves these results with high efficiency in
less than three iterations, paving the way for T2I systems that can better
reflect the ever-changing real world. Our demo code is available
herehttps://github.com/mhson-kyle/World-To-Image.