Token-Shuffle：迈向基于自回归模型的高分辨率图像生成

摘要

自回归（AR）模型在语言生成领域长期占据主导地位，如今也逐渐应用于图像合成，但常被认为在性能上不及基于扩散的模型。其主要局限在于AR模型需要处理大量图像标记，这限制了训练与推理效率以及图像分辨率。为此，我们提出了Token-Shuffle，一种新颖而简洁的方法，旨在减少Transformer中的图像标记数量。我们的核心洞见在于多模态大语言模型（MLLMs）中视觉词汇的维度冗余，即视觉编码器输出的低维视觉代码直接映射到高维语言词汇。基于此，我们设计了两个关键操作：token-shuffle，通过沿通道维度合并空间局部标记来减少输入标记数量；以及token-unshuffle，在Transformer块后解构推断出的标记，以恢复输出的空间排列。结合文本提示联合训练，我们的策略无需额外预训练文本编码器，使MLLMs能以统一的下一标记预测方式支持极高分辨率图像合成，同时保持高效的训练与推理。我们首次将AR文本到图像生成的边界推至2048x2048分辨率，并展现出令人满意的生成性能。在GenAI基准测试中，我们的2.7B模型在困难提示上获得了0.77的综合评分，分别超越AR模型LlamaGen 0.18分和扩散模型LDM 0.15分。大规模详尽的人类评估也证实了我们在文本对齐、视觉瑕疵及视觉外观方面的卓越图像生成能力。我们希望Token-Shuffle能成为MLLMs内高效高分辨率图像生成的基础设计。

English

Autoregressive (AR) models, long dominant in language generation, are increasingly applied to image synthesis but are often considered less competitive than Diffusion-based models. A primary limitation is the substantial number of image tokens required for AR models, which constrains both training and inference efficiency, as well as image resolution. To address this, we present Token-Shuffle, a novel yet simple method that reduces the number of image tokens in Transformer. Our key insight is the dimensional redundancy of visual vocabularies in Multimodal Large Language Models (MLLMs), where low-dimensional visual codes from visual encoder are directly mapped to high-dimensional language vocabularies. Leveraging this, we consider two key operations: token-shuffle, which merges spatially local tokens along channel dimension to decrease the input token number, and token-unshuffle, which untangles the inferred tokens after Transformer blocks to restore the spatial arrangement for output. Jointly training with textual prompts, our strategy requires no additional pretrained text-encoder and enables MLLMs to support extremely high-resolution image synthesis in a unified next-token prediction way while maintaining efficient training and inference. For the first time, we push the boundary of AR text-to-image generation to a resolution of 2048x2048 with gratifying generation performance. In GenAI-benchmark, our 2.7B model achieves 0.77 overall score on hard prompts, outperforming AR models LlamaGen by 0.18 and diffusion models LDM by 0.15. Exhaustive large-scale human evaluations also demonstrate our prominent image generation ability in terms of text-alignment, visual flaw, and visual appearance. We hope that Token-Shuffle can serve as a foundational design for efficient high-resolution image generation within MLLMs.

Token-Shuffle：迈向基于自回归模型的高分辨率图像生成

Token-Shuffle: Towards High-Resolution Image Generation with Autoregressive Models

摘要

Support