토큰 셔플: 자기회귀 모델을 통한 고해상도 이미지 생성 기술

초록

오토리그레시브(AR) 모델은 오랫동안 언어 생성 분야에서 주도적인 위치를 차지해 왔으며, 최근 이미지 합성에도 점점 더 많이 적용되고 있지만, 여전히 디퓨전 기반 모델에 비해 경쟁력이 떨어진다는 평가를 받고 있습니다. AR 모델의 주요 한계점은 상당히 많은 수의 이미지 토큰이 필요하다는 점으로, 이는 학습 및 추론 효율성과 이미지 해상도 모두를 제한합니다. 이를 해결하기 위해, 우리는 Transformer에서 이미지 토큰 수를 줄이는 간단하면서도 혁신적인 방법인 Token-Shuffle을 제안합니다. 우리의 핵심 통찰은 멀티모달 대형 언어 모델(MLLMs)에서 시각적 어휘의 차원적 중복성에 있습니다. 여기서 시각적 인코더에서 나온 저차원 시각적 코드가 고차원 언어 어휘에 직접 매핑됩니다. 이를 활용하여, 우리는 두 가지 주요 연산을 고려합니다: 채널 차원을 따라 공간적으로 인접한 토큰을 병합하여 입력 토큰 수를 줄이는 token-shuffle과, Transformer 블록 이후 추론된 토큰을 풀어 공간적 배열을 복원하는 token-unshuffle입니다. 텍스트 프롬프트와 함께 공동으로 학습함으로써, 우리의 전략은 추가적인 사전 학습된 텍스트 인코더 없이도 MLLMs가 효율적인 학습 및 추론을 유지하면서 통합된 다음 토큰 예측 방식으로 극도로 높은 해상도의 이미지 합성을 지원할 수 있게 합니다. 우리는 처음으로 AR 텍스트-이미지 생성의 한계를 2048x2048 해상도로 끌어올렸으며, 만족스러운 생성 성능을 보여줍니다. GenAI 벤치마크에서, 우리의 2.7B 모델은 어려운 프롬프트에서 0.77의 종합 점수를 달성하여, AR 모델 LlamaGen을 0.18, 디퓨전 모델 LDM을 0.15 앞섰습니다. 대규모 인간 평가에서도 텍스트 정렬, 시각적 결함, 시각적 외관 측면에서 우리의 뛰어난 이미지 생성 능력을 입증했습니다. 우리는 Token-Shuffle이 MLLMs 내에서 효율적인 고해상도 이미지 생성을 위한 기초적인 설계로 자리 잡기를 바랍니다.

English

Autoregressive (AR) models, long dominant in language generation, are increasingly applied to image synthesis but are often considered less competitive than Diffusion-based models. A primary limitation is the substantial number of image tokens required for AR models, which constrains both training and inference efficiency, as well as image resolution. To address this, we present Token-Shuffle, a novel yet simple method that reduces the number of image tokens in Transformer. Our key insight is the dimensional redundancy of visual vocabularies in Multimodal Large Language Models (MLLMs), where low-dimensional visual codes from visual encoder are directly mapped to high-dimensional language vocabularies. Leveraging this, we consider two key operations: token-shuffle, which merges spatially local tokens along channel dimension to decrease the input token number, and token-unshuffle, which untangles the inferred tokens after Transformer blocks to restore the spatial arrangement for output. Jointly training with textual prompts, our strategy requires no additional pretrained text-encoder and enables MLLMs to support extremely high-resolution image synthesis in a unified next-token prediction way while maintaining efficient training and inference. For the first time, we push the boundary of AR text-to-image generation to a resolution of 2048x2048 with gratifying generation performance. In GenAI-benchmark, our 2.7B model achieves 0.77 overall score on hard prompts, outperforming AR models LlamaGen by 0.18 and diffusion models LDM by 0.15. Exhaustive large-scale human evaluations also demonstrate our prominent image generation ability in terms of text-alignment, visual flaw, and visual appearance. We hope that Token-Shuffle can serve as a foundational design for efficient high-resolution image generation within MLLMs.

토큰 셔플: 자기회귀 모델을 통한 고해상도 이미지 생성 기술

Token-Shuffle: Towards High-Resolution Image Generation with Autoregressive Models

초록

Support