LaTtE-Flow：層級時間步專家流式變壓器

摘要

近期，统一图像理解与生成的多模态基础模型取得了显著进展，为在单一框架内解决广泛的视觉-语言任务开辟了激动人心的途径。尽管已有进步，现有的统一模型通常需要大量的预训练，并且难以达到与专为每项任务设计的模型相媲美的性能水平。此外，许多此类模型存在图像生成速度慢的问题，限制了其在实时或资源受限环境中的实际部署。在本研究中，我们提出了一种新颖且高效的架构——基于层级时间步专家流的Transformer（LaTtE-Flow），该架构在单一多模态模型中统一了图像理解与生成。LaTtE-Flow依托于强大的预训练视觉-语言模型（VLMs），继承了其卓越的多模态理解能力，并通过一种新颖的层级时间步专家流架构扩展了高效的图像生成功能。LaTtE-Flow将流匹配过程分配到专门的Transformer层组中，每组负责不同的时间步子集。这一设计通过在每个采样时间步仅激活一小部分层，显著提高了采样效率。为进一步提升性能，我们提出了一种时间步条件残差注意力机制，以实现跨层的高效信息复用。实验表明，LaTtE-Flow在多模态理解任务上表现出色，同时在图像生成质量上达到竞争水平，且推理速度比近期统一多模态模型快约6倍。

English

Recent advances in multimodal foundation models unifying image understanding and generation have opened exciting avenues for tackling a wide range of vision-language tasks within a single framework. Despite progress, existing unified models typically require extensive pretraining and struggle to achieve the same level of performance compared to models dedicated to each task. Additionally, many of these models suffer from slow image generation speeds, limiting their practical deployment in real-time or resource-constrained settings. In this work, we propose Layerwise Timestep-Expert Flow-based Transformer (LaTtE-Flow), a novel and efficient architecture that unifies image understanding and generation within a single multimodal model. LaTtE-Flow builds upon powerful pretrained Vision-Language Models (VLMs) to inherit strong multimodal understanding capabilities, and extends them with a novel Layerwise Timestep Experts flow-based architecture for efficient image generation. LaTtE-Flow distributes the flow-matching process across specialized groups of Transformer layers, each responsible for a distinct subset of timesteps. This design significantly improves sampling efficiency by activating only a small subset of layers at each sampling timestep. To further enhance performance, we propose a Timestep-Conditioned Residual Attention mechanism for efficient information reuse across layers. Experiments demonstrate that LaTtE-Flow achieves strong performance on multimodal understanding tasks, while achieving competitive image generation quality with around 6x faster inference speed compared to recent unified multimodal models.

LaTtE-Flow：層級時間步專家流式變壓器

LaTtE-Flow: Layerwise Timestep-Expert Flow-based Transformer

摘要

Support