ビジョン言語インテリジェンスの火花：効率的な細かい画像生成のための2次元自己回帰トランスフォーマー

要旨

この研究は、ベクトル量子化（VQ）自己回帰画像生成の情報損失ボトルネックに取り組み、新しいモデルアーキテクチャである2次元自己回帰（DnD）トランスフォーマーを導入することで対処しています。DnD-Transformerは、新しい自己回帰方向、モデルの深さ、およびシーケンス長方向を導入することで、画像に対してより多くのコードを予測します。従来の1次元自己回帰やRQ-Transformerなどの類似の2次元画像分解を利用した以前の研究と比較して、DnD-Transformerは、同じバックボーンモデルサイズとシーケンス長でより高品質な画像を生成できるエンドツーエンドモデルであり、自己回帰画像生成の新たな最適化視点を開くものです。さらに、実験では、DnD-Transformerの潜在能力が自然な画像を生成するだけでなく、豊富なテキストやグラフィカル要素を含む画像を自己教師付きで生成することが可能であり、これらの複合モダリティの理解を示しています。これは、従来の流行しているビジョン生成モデルである拡散モデルなどでは以前に実証されておらず、画像のみを学習した場合にビジョン言語知能の兆候を示しています。コード、データセット、モデルはhttps://github.com/chenllliang/DnD-Transformer で公開されています。

English

This work tackles the information loss bottleneck of vector-quantization (VQ) autoregressive image generation by introducing a novel model architecture called the 2-Dimensional Autoregression (DnD) Transformer. The DnD-Transformer predicts more codes for an image by introducing a new autoregression direction, model depth, along with the sequence length direction. Compared to traditional 1D autoregression and previous work utilizing similar 2D image decomposition such as RQ-Transformer, the DnD-Transformer is an end-to-end model that can generate higher quality images with the same backbone model size and sequence length, opening a new optimization perspective for autoregressive image generation. Furthermore, our experiments reveal that the DnD-Transformer's potential extends beyond generating natural images. It can even generate images with rich text and graphical elements in a self-supervised manner, demonstrating an understanding of these combined modalities. This has not been previously demonstrated for popular vision generative models such as diffusion models, showing a spark of vision-language intelligence when trained solely on images. Code, datasets and models are open at https://github.com/chenllliang/DnD-Transformer.

ビジョン言語インテリジェンスの火花：効率的な細かい画像生成のための2次元自己回帰トランスフォーマー

A Spark of Vision-Language Intelligence: 2-Dimensional Autoregressive Transformer for Efficient Finegrained Image Generation

要旨

Support