UniReason 1.0: 世界知識に整合した画像生成・編集のための統一推論フレームワーク

要旨

統合マルチモーダルモデルは、深い推論を要する複雑な合成タスクに苦戦することが多く、テキストからの画像生成と画像編集を通常は独立した能力として扱い、相互接続された推論ステップとして捉えていません。この問題に対処するため、我々は二重推論パラダイムを通じてこれら二つのタスクを調和させる統一フレームワーク「UniReason」を提案します。生成を暗黙的な制約を注入する世界知識強化型計画として定式化し、編集能力を細粒度の視覚的洗練に活用して自己反省による視覚的誤りの修正をさらに進めます。このアプローチは、計画とその後の洗練という人間の認知プロセスを反映し、生成と編集を共有された表現内で統一します。このフレームワークを支えるため、計画のために5つの主要知識領域（文化的常識、物理学など）をカバーする大規模な推論中心データセット（約30万サンプル）と、視覚的自己修正のためのエージェント生成コーパスを体系的に構築しました。大規模な実験により、UniReasonがWISE、KrisBench、UniREditBenchなどの推論集約型ベンチマークで先進的な性能を達成し、優れた汎用合成能力を維持することを実証しました。

English

Unified multimodal models often struggle with complex synthesis tasks that demand deep reasoning, and typically treat text-to-image generation and image editing as isolated capabilities rather than interconnected reasoning steps. To address this, we propose UniReason, a unified framework that harmonizes these two tasks through a dual reasoning paradigm. We formulate generation as world knowledge-enhanced planning to inject implicit constraints, and leverage editing capabilities for fine-grained visual refinement to further correct visual errors via self-reflection. This approach unifies generation and editing within a shared representation, mirroring the human cognitive process of planning followed by refinement. We support this framework by systematically constructing a large-scale reasoning-centric dataset (~300k samples) covering five major knowledge domains (e.g., cultural commonsense, physics, etc.) for planning, alongside an agent-generated corpus for visual self-correction. Extensive experiments demonstrate that UniReason achieves advanced performance on reasoning-intensive benchmarks such as WISE, KrisBench and UniREditBench, while maintaining superior general synthesis capabilities.

UniReason 1.0: 世界知識に整合した画像生成・編集のための統一推論フレームワーク

UniReason 1.0: A Unified Reasoning Framework for World Knowledge Aligned Image Generation and Editing

要旨

Support