Token-Shuffle:迈向基于自回归模型的高分辨率图像生成
Token-Shuffle: Towards High-Resolution Image Generation with Autoregressive Models
April 24, 2025
作者: Xu Ma, Peize Sun, Haoyu Ma, Hao Tang, Chih-Yao Ma, Jialiang Wang, Kunpeng Li, Xiaoliang Dai, Yujun Shi, Xuan Ju, Yushi Hu, Artsiom Sanakoyeu, Felix Juefei-Xu, Ji Hou, Junjiao Tian, Tao Xu, Tingbo Hou, Yen-Cheng Liu, Zecheng He, Zijian He, Matt Feiszli, Peizhao Zhang, Peter Vajda, Sam Tsai, Yun Fu
cs.AI
摘要
自回归(AR)模型在语言生成领域长期占据主导地位,如今也逐渐应用于图像合成,但常被认为在性能上不及基于扩散的模型。其主要局限在于AR模型需要处理大量图像标记,这限制了训练与推理效率以及图像分辨率。为此,我们提出了Token-Shuffle,一种新颖而简洁的方法,旨在减少Transformer中的图像标记数量。我们的核心洞见在于多模态大语言模型(MLLMs)中视觉词汇的维度冗余,即视觉编码器输出的低维视觉代码直接映射到高维语言词汇。基于此,我们设计了两个关键操作:token-shuffle,通过沿通道维度合并空间局部标记来减少输入标记数量;以及token-unshuffle,在Transformer块后解构推断出的标记,以恢复输出的空间排列。结合文本提示联合训练,我们的策略无需额外预训练文本编码器,使MLLMs能以统一的下一标记预测方式支持极高分辨率图像合成,同时保持高效的训练与推理。我们首次将AR文本到图像生成的边界推至2048x2048分辨率,并展现出令人满意的生成性能。在GenAI基准测试中,我们的2.7B模型在困难提示上获得了0.77的综合评分,分别超越AR模型LlamaGen 0.18分和扩散模型LDM 0.15分。大规模详尽的人类评估也证实了我们在文本对齐、视觉瑕疵及视觉外观方面的卓越图像生成能力。我们希望Token-Shuffle能成为MLLMs内高效高分辨率图像生成的基础设计。
English
Autoregressive (AR) models, long dominant in language generation, are
increasingly applied to image synthesis but are often considered less
competitive than Diffusion-based models. A primary limitation is the
substantial number of image tokens required for AR models, which constrains
both training and inference efficiency, as well as image resolution. To address
this, we present Token-Shuffle, a novel yet simple method that reduces the
number of image tokens in Transformer. Our key insight is the dimensional
redundancy of visual vocabularies in Multimodal Large Language Models (MLLMs),
where low-dimensional visual codes from visual encoder are directly mapped to
high-dimensional language vocabularies. Leveraging this, we consider two key
operations: token-shuffle, which merges spatially local tokens along channel
dimension to decrease the input token number, and token-unshuffle, which
untangles the inferred tokens after Transformer blocks to restore the spatial
arrangement for output. Jointly training with textual prompts, our strategy
requires no additional pretrained text-encoder and enables MLLMs to support
extremely high-resolution image synthesis in a unified next-token prediction
way while maintaining efficient training and inference. For the first time, we
push the boundary of AR text-to-image generation to a resolution of 2048x2048
with gratifying generation performance. In GenAI-benchmark, our 2.7B model
achieves 0.77 overall score on hard prompts, outperforming AR models LlamaGen
by 0.18 and diffusion models LDM by 0.15. Exhaustive large-scale human
evaluations also demonstrate our prominent image generation ability in terms of
text-alignment, visual flaw, and visual appearance. We hope that Token-Shuffle
can serve as a foundational design for efficient high-resolution image
generation within MLLMs.Summary
AI-Generated Summary