Qwen-Image-Agent: 実世界の画像生成におけるコンテキストギャップを埋める

要旨

テキストから画像を生成する（T2I）モデルは顕著な進歩を遂げているものの、明示性が低かったり暗黙的であったり、最新の知識に依存する実世界のリクエストに対しては困難を抱えている。我々はこの課題をコンテキストギャップ、すなわちユーザーのコンテキストとT2Iモデルにとって十分な生成コンテキストとの間のミスマッチとして特定する。このギャップを埋めるために、我々はQwen-Image-Agentを提案する。これは計画、推論、検索、記憶、フィードバックをコンテキスト中心に統合した統一的なエージェントフレームワークである。Qwen-Image-Agentはユーザー入力を部分的なコンテキストとして扱い、コンテキスト認識型計画とコンテキストグラウンディングを通じて生成コンテキストを段階的に構築する。具体的には、コンテキスト認識型計画は不足しているコンテキストを特定し、それをどのように取得・利用するかを計画する。一方、コンテキストグラウンディングは推論、検索、記憶、フィードバックからこのコンテキストを収集する。エージェント画像生成を評価するために、我々はさらにImage Agent Bench（IA-Bench）を導入する。これはプラン、推論、検索、記憶の四つのコア画像エージェント機能をカバーするベンチマークである。IA-Bench、Mindbench、WISE-Verifiedにおける実験では、Qwen-Image-Agentが強力なベースラインを上回り、最先端の性能を達成したことが示された。

English

While text-to-image (T2I) models have achieved remarkable progress, they struggle with real-world requests that are often underspecified, implicit, or dependent on up-to-date knowledge. We identify this challenge as the Context Gap: the mismatch between the user context and the sufficient generation context for T2I models. To bridge this gap, we propose Qwen-Image-Agent, a unified agentic framework that integrates plan, reason, search, memory and feedback in a context-centric manner. Qwen-Image-Agent treats user input as partial context and progressively constructs the generation context through Context-Aware Planning and Context Grounding. Specifically, Context-Aware Planning identifies missing context and plans how it should be acquired and used, while Context Grounding gathers this context from reason, search, memory, and feedback. To evaluate agentic image generation, we further introduce Image Agent Bench (IA-Bench), a benchmark covering four core image agent capabilities: Plan, Reason, Search, and Memory. Experiments on IA-Bench, Mindbench and WISE-Verified show that Qwen-Image-Agent outperforms strong baselines and achieves state-of-the-art performance.