MonoFormer:一个Transformer用于扩散和自回归
MonoFormer: One Transformer for Both Diffusion and Autoregression
September 24, 2024
作者: Chuyang Zhao, Yuxing Song, Wenhao Wang, Haocheng Feng, Errui Ding, Yifan Sun, Xinyan Xiao, Jingdong Wang
cs.AI
摘要
大多数现有的多模态方法使用单独的主干网络进行基于自回归的离散文本生成和基于扩散的连续视觉生成,或者通过对视觉数据进行离散化,使用自回归来进行文本和视觉生成。在本文中,我们提出研究一个简单的想法:共享一个Transformer用于自回归和扩散。这种可行性来自两个主要方面:(i) Transformer成功应用于视觉生成的扩散,以及(ii) 用于自回归和扩散的Transformer训练非常相似,区别仅在于扩散使用双向注意力掩码,而自回归使用因果注意力掩码。实验结果表明,我们的方法实现了与当前最先进方法相当的图像生成性能,并保持了文本生成能力。该项目可在https://monoformer.github.io/ 上公开获取。
English
Most existing multimodality methods use separate backbones for
autoregression-based discrete text generation and diffusion-based continuous
visual generation, or the same backbone by discretizing the visual data to use
autoregression for both text and visual generation. In this paper, we propose
to study a simple idea: share one transformer for both autoregression and
diffusion. The feasibility comes from two main aspects: (i) Transformer is
successfully applied to diffusion for visual generation, and (ii) transformer
training for autoregression and diffusion is very similar, and the difference
merely lies in that diffusion uses bidirectional attention mask and
autoregression uses causal attention mask. Experimental results show that our
approach achieves comparable image generation performance to current
state-of-the-art methods as well as maintains the text generation capability.
The project is publicly available at https://monoformer.github.io/.Summary
AI-Generated Summary