ワールド・トゥ・イメージ：エージェント駆動型の世界知識に基づくテキスト・トゥ・イメージ生成

要旨

テキストから画像（T2I）モデルは高品質な画像を合成できるが、新規または分布外（OOD）のエンティティをプロンプトとして与えた場合、その性能は内在的な知識の制限により大幅に低下する。本論文では、エージェント駆動の世界知識を活用してT2I生成を強化する新たなフレームワーク「World-To-Image」を提案する。このフレームワークでは、ベースモデルが知らない概念に対して、エージェントが動的にウェブを検索し、関連する画像を取得する。この情報を用いてマルチモーダルプロンプト最適化を行い、強力な生成バックボーンを正確な合成へと導く。重要な点として、我々の評価は従来の指標を超え、LLMGraderやImageRewardといった現代的な評価手法を活用して真の意味的忠実度を測定する。実験結果から、World-To-Imageは意味的整合性と視覚的美観の両面において最先端の手法を大幅に上回り、我々が策定したNICEベンチマークにおいてプロンプトに対する精度で+8.1%の改善を達成した。本フレームワークは、3回未満の反復で高い効率性を実現し、変化し続ける現実世界をより良く反映するT2Iシステムの道を開くものである。デモコードはhttps://github.com/mhson-kyle/World-To-Imageで公開されている。

English

While text-to-image (T2I) models can synthesize high-quality images, their performance degrades significantly when prompted with novel or out-of-distribution (OOD) entities due to inherent knowledge cutoffs. We introduce World-To-Image, a novel framework that bridges this gap by empowering T2I generation with agent-driven world knowledge. We design an agent that dynamically searches the web to retrieve images for concepts unknown to the base model. This information is then used to perform multimodal prompt optimization, steering powerful generative backbones toward an accurate synthesis. Critically, our evaluation goes beyond traditional metrics, utilizing modern assessments like LLMGrader and ImageReward to measure true semantic fidelity. Our experiments show that World-To-Image substantially outperforms state-of-the-art methods in both semantic alignment and visual aesthetics, achieving +8.1% improvement in accuracy-to-prompt on our curated NICE benchmark. Our framework achieves these results with high efficiency in less than three iterations, paving the way for T2I systems that can better reflect the ever-changing real world. Our demo code is available herehttps://github.com/mhson-kyle/World-To-Image.

ワールド・トゥ・イメージ：エージェント駆動型の世界知識に基づくテキスト・トゥ・イメージ生成

World-To-Image: Grounding Text-to-Image Generation with Agent-Driven World Knowledge

要旨

Support