一縷視覺語言智能的火花:用於高效細粒度圖像生成的二維自回歸Transformer
A Spark of Vision-Language Intelligence: 2-Dimensional Autoregressive Transformer for Efficient Finegrained Image Generation
October 2, 2024
作者: Liang Chen, Sinan Tan, Zefan Cai, Weichu Xie, Haozhe Zhao, Yichi Zhang, Junyang Lin, Jinze Bai, Tianyu Liu, Baobao Chang
cs.AI
摘要
本研究通過引入一種名為二維自迴歸(DnD)Transformer的新型模型架構,解決了向量量化(VQ)自迴歸圖像生成的信息損失瓶頸問題。DnD-Transformer通過引入新的自迴歸方向、模型深度以及序列長度方向,為圖像預測更多編碼。與傳統的一維自迴歸和先前利用類似二維圖像分解的RQ-Transformer等工作相比,DnD-Transformer是一個端到端模型,可以在相同的骨幹模型大小和序列長度下生成更高質量的圖像,為自迴歸圖像生成開啟了新的優化視角。此外,我們的實驗顯示,DnD-Transformer的潛力不僅限於生成自然圖像。它甚至可以以自監督方式生成具有豐富文本和圖形元素的圖像,展示對這些結合模態的理解。這在流行的視覺生成模型(如擴散模型)中以往並未展示,表明當僅在圖像上進行訓練時,展現了一種視覺語言智能的閃光。代碼、數據集和模型可在https://github.com/chenllliang/DnD-Transformer找到。
English
This work tackles the information loss bottleneck of vector-quantization (VQ)
autoregressive image generation by introducing a novel model architecture
called the 2-Dimensional Autoregression (DnD) Transformer. The DnD-Transformer
predicts more codes for an image by introducing a new autoregression direction,
model depth, along with the sequence length direction. Compared to
traditional 1D autoregression and previous work utilizing similar 2D image
decomposition such as RQ-Transformer, the DnD-Transformer is an end-to-end
model that can generate higher quality images with the same backbone model size
and sequence length, opening a new optimization perspective for autoregressive
image generation. Furthermore, our experiments reveal that the
DnD-Transformer's potential extends beyond generating natural images. It can
even generate images with rich text and graphical elements in a self-supervised
manner, demonstrating an understanding of these combined modalities. This has
not been previously demonstrated for popular vision generative models such as
diffusion models, showing a spark of vision-language intelligence when trained
solely on images. Code, datasets and models are open at
https://github.com/chenllliang/DnD-Transformer.Summary
AI-Generated Summary