一縷視覺語言智能的火花：用於高效細粒度圖像生成的二維自回歸Transformer

摘要

本研究通過引入一種名為二維自迴歸（DnD）Transformer的新型模型架構，解決了向量量化（VQ）自迴歸圖像生成的信息損失瓶頸問題。DnD-Transformer通過引入新的自迴歸方向、模型深度以及序列長度方向，為圖像預測更多編碼。與傳統的一維自迴歸和先前利用類似二維圖像分解的RQ-Transformer等工作相比，DnD-Transformer是一個端到端模型，可以在相同的骨幹模型大小和序列長度下生成更高質量的圖像，為自迴歸圖像生成開啟了新的優化視角。此外，我們的實驗顯示，DnD-Transformer的潛力不僅限於生成自然圖像。它甚至可以以自監督方式生成具有豐富文本和圖形元素的圖像，展示對這些結合模態的理解。這在流行的視覺生成模型（如擴散模型）中以往並未展示，表明當僅在圖像上進行訓練時，展現了一種視覺語言智能的閃光。代碼、數據集和模型可在https://github.com/chenllliang/DnD-Transformer找到。

English

This work tackles the information loss bottleneck of vector-quantization (VQ) autoregressive image generation by introducing a novel model architecture called the 2-Dimensional Autoregression (DnD) Transformer. The DnD-Transformer predicts more codes for an image by introducing a new autoregression direction, model depth, along with the sequence length direction. Compared to traditional 1D autoregression and previous work utilizing similar 2D image decomposition such as RQ-Transformer, the DnD-Transformer is an end-to-end model that can generate higher quality images with the same backbone model size and sequence length, opening a new optimization perspective for autoregressive image generation. Furthermore, our experiments reveal that the DnD-Transformer's potential extends beyond generating natural images. It can even generate images with rich text and graphical elements in a self-supervised manner, demonstrating an understanding of these combined modalities. This has not been previously demonstrated for popular vision generative models such as diffusion models, showing a spark of vision-language intelligence when trained solely on images. Code, datasets and models are open at https://github.com/chenllliang/DnD-Transformer.

一縷視覺語言智能的火花：用於高效細粒度圖像生成的二維自回歸Transformer

A Spark of Vision-Language Intelligence: 2-Dimensional Autoregressive Transformer for Efficient Finegrained Image Generation

摘要

Support