UniFork：探索模态对齐以实现统一的多模态理解与生成

摘要

统一图像理解与生成已成为多模态人工智能领域的一个极具前景的研究范式。尽管近期取得了一定进展，但此类统一模型的最优架构设计仍是一个开放性问题。在本研究中，我们首先分析了针对理解与生成任务的特化专家模型以及现有统一模型的模态对齐行为。我们的分析揭示了一个关键发现：理解任务受益于网络深度方向上逐步增强的模态对齐，这有助于构建语义信息以实现更好的理解；相比之下，生成任务呈现出不同的趋势：模态对齐在浅层增加，但在深层减少以恢复空间细节。这些不同的对齐模式在完全共享的Transformer骨干网络中产生了根本性冲突，其中统一的表征流通常会导致两项任务的性能折衷。基于这一发现，我们提出了UniFork，一种新颖的Y形架构，它在浅层共享跨任务表征学习，同时在深层采用任务特定分支以避免任务干扰。这一设计有效地平衡了共享学习与任务专业化。通过大量消融实验，我们证明了UniFork在性能上始终优于传统的完全共享Transformer架构，并达到或超越了任务特定模型的水平。

English

Unified image understanding and generation has emerged as a promising paradigm in multimodal artificial intelligence. Despite recent progress, the optimal architectural design for such unified models remains an open challenge. In this work, we start by analyzing the modality alignment behaviors of task-specific expert models for understanding and generation, as well as current unified models. Our analysis reveals a crucial observation: understanding tasks benefit from a progressively increasing modality alignment across network depth, which helps build up semantic information for better comprehension; In contrast, generation tasks follow a different trend: modality alignment increases in the early layers but decreases in the deep layers to recover spatial details. These divergent alignment patterns create a fundamental conflict in fully shared Transformer backbones, where a uniform representational flow often leads to performance compromises across two tasks. Motivated by this finding, we introduce UniFork, a novel Y-shaped architecture that shares the shallow layers for cross-task representation learning, while employing task-specific branches in deeper layers to avoid task interference. This design effectively balances shared learning and task specialization. Through extensive ablation experiments, we demonstrate that Unifork consistently outperforms conventional fully shared Transformer architectures, and achieves performance on par with or better than task-specific models.

UniFork：探索模态对齐以实现统一的多模态理解与生成

UniFork: Exploring Modality Alignment for Unified Multimodal Understanding and Generation

摘要

Support