Qwen-Image-Agent：弥合真实世界图像生成中的上下文差距

摘要

尽管文本到图像（T2I）模型已取得显著进展，但在处理现实场景中那些表述模糊、隐含或依赖最新知识的复杂请求时仍显不足。我们将这一挑战定义为“上下文鸿沟”：即用户上下文与T2I模型所需的充分生成上下文之间的不匹配。为弥合这一鸿沟，我们提出了Qwen-Image-Agent——一个以上下文为中心的统一代理框架，集成了规划、推理、搜索、记忆和反馈机制。该框架将用户输入视为部分上下文，并通过“上下文感知规划”与“上下文锚定”逐步构建完整的生成上下文。具体而言，上下文感知规划负责识别缺失的上下文并规划其获取与使用方式，而上下文锚定则通过推理、搜索、记忆和反馈机制收集这些上下文信息。为评估代理式图像生成能力，我们进一步引入了Image Agent Bench（IA-Bench）基准测试，涵盖代理图像生成的四大核心能力：规划、推理、搜索与记忆。在IA-Bench、Mindbench和WISE-Verified上的实验结果表明，Qwen-Image-Agent显著超越强基线模型，达到了当前最佳性能水平。

English

While text-to-image (T2I) models have achieved remarkable progress, they struggle with real-world requests that are often underspecified, implicit, or dependent on up-to-date knowledge. We identify this challenge as the Context Gap: the mismatch between the user context and the sufficient generation context for T2I models. To bridge this gap, we propose Qwen-Image-Agent, a unified agentic framework that integrates plan, reason, search, memory and feedback in a context-centric manner. Qwen-Image-Agent treats user input as partial context and progressively constructs the generation context through Context-Aware Planning and Context Grounding. Specifically, Context-Aware Planning identifies missing context and plans how it should be acquired and used, while Context Grounding gathers this context from reason, search, memory, and feedback. To evaluate agentic image generation, we further introduce Image Agent Bench (IA-Bench), a benchmark covering four core image agent capabilities: Plan, Reason, Search, and Memory. Experiments on IA-Bench, Mindbench and WISE-Verified show that Qwen-Image-Agent outperforms strong baselines and achieves state-of-the-art performance.