UniFork: 統一されたマルチモーダル理解と生成のためのモダリティアライメントの探求

要旨

統一的な画像理解と生成は、マルチモーダル人工知能における有望なパラダイムとして登場してきました。しかし、最近の進展にもかかわらず、このような統一モデルの最適なアーキテクチャ設計は未解決の課題です。本研究ではまず、理解と生成のためのタスク特化型エキスパートモデル、および現在の統一モデルのモダリティアライメントの挙動を分析します。私たちの分析から重要な知見が得られました：理解タスクでは、ネットワークの深さに応じてモダリティアライメントが段階的に増加することが有益であり、これによりセマンティック情報が構築され、より良い理解が可能になります。一方、生成タスクでは異なる傾向が見られ、初期層ではモダリティアライメントが増加しますが、深層では空間的詳細を回復するために減少します。これらの異なるアライメントパターンは、完全に共有されたTransformerバックボーンにおいて根本的な矛盾を引き起こし、均一な表現フローが2つのタスク間で性能の妥協を招くことが多いのです。この発見に基づき、私たちはUniForkを提案します。これは、浅い層をクロスタスク表現学習のために共有しつつ、深い層ではタスク固有のブランチを採用してタスク間の干渉を回避する新しいY字型アーキテクチャです。この設計は、共有学習とタスク特化のバランスを効果的に取ります。広範なアブレーション実験を通じて、UniForkが従来の完全共有型Transformerアーキテクチャを一貫して上回り、タスク特化型モデルと同等またはそれ以上の性能を達成することを実証しました。

English

Unified image understanding and generation has emerged as a promising paradigm in multimodal artificial intelligence. Despite recent progress, the optimal architectural design for such unified models remains an open challenge. In this work, we start by analyzing the modality alignment behaviors of task-specific expert models for understanding and generation, as well as current unified models. Our analysis reveals a crucial observation: understanding tasks benefit from a progressively increasing modality alignment across network depth, which helps build up semantic information for better comprehension; In contrast, generation tasks follow a different trend: modality alignment increases in the early layers but decreases in the deep layers to recover spatial details. These divergent alignment patterns create a fundamental conflict in fully shared Transformer backbones, where a uniform representational flow often leads to performance compromises across two tasks. Motivated by this finding, we introduce UniFork, a novel Y-shaped architecture that shares the shallow layers for cross-task representation learning, while employing task-specific branches in deeper layers to avoid task interference. This design effectively balances shared learning and task specialization. Through extensive ablation experiments, we demonstrate that Unifork consistently outperforms conventional fully shared Transformer architectures, and achieves performance on par with or better than task-specific models.

UniFork: 統一されたマルチモーダル理解と生成のためのモダリティアライメントの探求

UniFork: Exploring Modality Alignment for Unified Multimodal Understanding and Generation

要旨

Support