一缕视觉语言智能的火花:用于高效细粒度图像生成的二维自回归Transformer
A Spark of Vision-Language Intelligence: 2-Dimensional Autoregressive Transformer for Efficient Finegrained Image Generation
October 2, 2024
作者: Liang Chen, Sinan Tan, Zefan Cai, Weichu Xie, Haozhe Zhao, Yichi Zhang, Junyang Lin, Jinze Bai, Tianyu Liu, Baobao Chang
cs.AI
摘要
本研究通过引入一种名为二维自回归(DnD)Transformer的新型模型架构,解决了向量量化(VQ)自回归图像生成中的信息丢失瓶颈问题。DnD-Transformer通过引入新的自回归方向、模型深度以及序列长度方向,为图像预测更多的代码。与传统的一维自回归和之前利用类似二维图像分解的RQ-Transformer的工作相比,DnD-Transformer是一个端到端模型,可以在相同的骨干模型大小和序列长度下生成更高质量的图像,为自回归图像生成开辟了新的优化视角。此外,我们的实验表明,DnD-Transformer的潜力不仅限于生成自然图像。它甚至可以以自监督方式生成具有丰富文本和图形元素的图像,展示了对这些组合模态的理解。这在流行的视觉生成模型(如扩散模型)中尚未有过先例,表明当仅在图像上进行训练时,展现了一种视觉-语言智能的闪光点。代码、数据集和模型可在https://github.com/chenllliang/DnD-Transformer找到。
English
This work tackles the information loss bottleneck of vector-quantization (VQ)
autoregressive image generation by introducing a novel model architecture
called the 2-Dimensional Autoregression (DnD) Transformer. The DnD-Transformer
predicts more codes for an image by introducing a new autoregression direction,
model depth, along with the sequence length direction. Compared to
traditional 1D autoregression and previous work utilizing similar 2D image
decomposition such as RQ-Transformer, the DnD-Transformer is an end-to-end
model that can generate higher quality images with the same backbone model size
and sequence length, opening a new optimization perspective for autoregressive
image generation. Furthermore, our experiments reveal that the
DnD-Transformer's potential extends beyond generating natural images. It can
even generate images with rich text and graphical elements in a self-supervised
manner, demonstrating an understanding of these combined modalities. This has
not been previously demonstrated for popular vision generative models such as
diffusion models, showing a spark of vision-language intelligence when trained
solely on images. Code, datasets and models are open at
https://github.com/chenllliang/DnD-Transformer.Summary
AI-Generated Summary