统一智能体：面向真实世界图像合成的多模态统一代理系统

摘要

统一多模态模型为理解多样复杂的现实世界知识并生成高质量图像提供了一种自然且前景广阔的架构。然而，这类模型仍主要依赖冻结的参数化知识，导致其在涉及长尾和知识密集型概念的实景图像生成中存在局限。受智能体在现实任务中广泛成功的启发，我们探索通过智能体建模来解决这一问题。具体而言，我们提出Unify-Agent——一个面向世界知识落地的图像合成的统一多模态智能体，将图像生成重构为由提示理解、多模态证据检索、基于事实的标题重述及最终合成构成的智能体流程。为训练模型，我们构建了定制化的多模态数据管道，并精心标注了14.3万条高质量的世界知识落地图像合成智能体轨迹，实现对完整智能体生成过程的有效监督。我们进一步推出FactIP基准数据集，涵盖12类具有文化意义的长尾事实概念，明确要求外部知识落地。大量实验表明，我们提出的Unify-Agent在多样化基准测试和实际生成任务中显著优于其基础统一模型，同时接近最强闭源模型的世界知识处理能力。作为基于智能体的世界知识落地图像合成的早期探索，我们的工作凸显了将推理、检索与生成紧密耦合对于实现可靠开放世界智能体图像合成的重要价值。

English

Unified multimodal models provide a natural and promising architecture for understanding diverse and complex real-world knowledge while generating high-quality images. However, they still rely primarily on frozen parametric knowledge, which makes them struggle with real-world image generation involving long-tail and knowledge-intensive concepts. Inspired by the broad success of agents on real-world tasks, we explore agentic modeling to address this limitation. Specifically, we present Unify-Agent, a unified multimodal agent for world-grounded image synthesis, which reframes image generation as an agentic pipeline consisting of prompt understanding, multimodal evidence searching, grounded recaptioning, and final synthesis. To train our model, we construct a tailored multimodal data pipeline and curate 143K high-quality agent trajectories for world-grounded image synthesis, enabling effective supervision over the full agentic generation process. We further introduce FactIP, a benchmark covering 12 categories of culturally significant and long-tail factual concepts that explicitly requires external knowledge grounding. Extensive experiments show that our proposed Unify-Agent substantially improves over its base unified model across diverse benchmarks and real world generation tasks, while approaching the world knowledge capabilities of the strongest closed-source models. As an early exploration of agent-based modeling for world-grounded image synthesis, our work highlights the value of tightly coupling reasoning, searching, and generation for reliable open-world agentic image synthesis.