LaTtE-Flow:層級時間步專家流式變壓器
LaTtE-Flow: Layerwise Timestep-Expert Flow-based Transformer
June 8, 2025
作者: Ying Shen, Zhiyang Xu, Jiuhai Chen, Shizhe Diao, Jiaxin Zhang, Yuguang Yao, Joy Rimchala, Ismini Lourentzou, Lifu Huang
cs.AI
摘要
近期,统一图像理解与生成的多模态基础模型取得了显著进展,为在单一框架内解决广泛的视觉-语言任务开辟了激动人心的途径。尽管已有进步,现有的统一模型通常需要大量的预训练,并且难以达到与专为每项任务设计的模型相媲美的性能水平。此外,许多此类模型存在图像生成速度慢的问题,限制了其在实时或资源受限环境中的实际部署。在本研究中,我们提出了一种新颖且高效的架构——基于层级时间步专家流的Transformer(LaTtE-Flow),该架构在单一多模态模型中统一了图像理解与生成。LaTtE-Flow依托于强大的预训练视觉-语言模型(VLMs),继承了其卓越的多模态理解能力,并通过一种新颖的层级时间步专家流架构扩展了高效的图像生成功能。LaTtE-Flow将流匹配过程分配到专门的Transformer层组中,每组负责不同的时间步子集。这一设计通过在每个采样时间步仅激活一小部分层,显著提高了采样效率。为进一步提升性能,我们提出了一种时间步条件残差注意力机制,以实现跨层的高效信息复用。实验表明,LaTtE-Flow在多模态理解任务上表现出色,同时在图像生成质量上达到竞争水平,且推理速度比近期统一多模态模型快约6倍。
English
Recent advances in multimodal foundation models unifying image understanding
and generation have opened exciting avenues for tackling a wide range of
vision-language tasks within a single framework. Despite progress, existing
unified models typically require extensive pretraining and struggle to achieve
the same level of performance compared to models dedicated to each task.
Additionally, many of these models suffer from slow image generation speeds,
limiting their practical deployment in real-time or resource-constrained
settings. In this work, we propose Layerwise Timestep-Expert Flow-based
Transformer (LaTtE-Flow), a novel and efficient architecture that unifies image
understanding and generation within a single multimodal model. LaTtE-Flow
builds upon powerful pretrained Vision-Language Models (VLMs) to inherit strong
multimodal understanding capabilities, and extends them with a novel Layerwise
Timestep Experts flow-based architecture for efficient image generation.
LaTtE-Flow distributes the flow-matching process across specialized groups of
Transformer layers, each responsible for a distinct subset of timesteps. This
design significantly improves sampling efficiency by activating only a small
subset of layers at each sampling timestep. To further enhance performance, we
propose a Timestep-Conditioned Residual Attention mechanism for efficient
information reuse across layers. Experiments demonstrate that LaTtE-Flow
achieves strong performance on multimodal understanding tasks, while achieving
competitive image generation quality with around 6x faster inference speed
compared to recent unified multimodal models.