Unify-Agent: 世界に基づいた画像合成のための統合マルチモーダルエージェント

要旨

統一的なマルチモーダルモデルは、多様で複雑な実世界の知識を理解しつつ高品質な画像を生成するための自然で有望なアーキテクチャを提供する。しかし、これらのモデルは依然として主に固定されたパラメトリック知識に依存しており、ロングテールで知識集約的な概念を含む実世界の画像生成に苦戦する。現実世界タスクにおけるエージェントの広範な成功に着想を得て、我々はこの制限を解決するためのエージェント的モデリングを探求する。具体的には、世界に根ざした画像合成のための統一マルチモーダルエージェントであるUnify-Agentを提案する。これは、画像生成を、プロンプト理解、マルチモーダル証拠検索、根拠に基づいた再キャプション化、最終合成から構成されるエージェント的パイプラインとして再定義する。本モデルを訓練するため、我々は専用のマルチモーダルデータパイプラインを構築し、世界に根ざした画像合成のための14万3千件の高品質なエージェント軌跡をキュレーションし、エージェント的生成プロセス全体に対する効果的な監督を可能にした。さらに、文化的に重要かつロングテールな事実的概念を12カテゴリ網羅し、外部知識の接地を明示的に要求するベンチマークFactIPを導入する。大規模な実験により、提案するUnify-Agentが、多様なベンチマークおよび実世界の生成タスクにおいて、その基盤となる統一モデルを大幅に改善し、最も強力なクローズドソースモデルの世界知識能力に迫ることが示された。世界に根ざした画像合成のためのエージェントベースモデリングの初期探求として、本研究成果は、信頼性の高いオープンワールドのエージェント的画像合成のための、推論、検索、生成の緊密な連携の価値を強調するものである。

English

Unified multimodal models provide a natural and promising architecture for understanding diverse and complex real-world knowledge while generating high-quality images. However, they still rely primarily on frozen parametric knowledge, which makes them struggle with real-world image generation involving long-tail and knowledge-intensive concepts. Inspired by the broad success of agents on real-world tasks, we explore agentic modeling to address this limitation. Specifically, we present Unify-Agent, a unified multimodal agent for world-grounded image synthesis, which reframes image generation as an agentic pipeline consisting of prompt understanding, multimodal evidence searching, grounded recaptioning, and final synthesis. To train our model, we construct a tailored multimodal data pipeline and curate 143K high-quality agent trajectories for world-grounded image synthesis, enabling effective supervision over the full agentic generation process. We further introduce FactIP, a benchmark covering 12 categories of culturally significant and long-tail factual concepts that explicitly requires external knowledge grounding. Extensive experiments show that our proposed Unify-Agent substantially improves over its base unified model across diverse benchmarks and real world generation tasks, while approaching the world knowledge capabilities of the strongest closed-source models. As an early exploration of agent-based modeling for world-grounded image synthesis, our work highlights the value of tightly coupling reasoning, searching, and generation for reliable open-world agentic image synthesis.

Unify-Agent: 世界に基づいた画像合成のための統合マルチモーダルエージェント

Unify-Agent: A Unified Multimodal Agent for World-Grounded Image Synthesis

要旨

Support