ChatPaper.aiChatPaper

一张图像价值32个令牌用于重建和生成。

An Image is Worth 32 Tokens for Reconstruction and Generation

June 11, 2024
作者: Qihang Yu, Mark Weber, Xueqing Deng, Xiaohui Shen, Daniel Cremers, Liang-Chieh Chen
cs.AI

摘要

最近生成模型的进展突显了图像记号化在高分辨率图像高效合成中的关键作用。记号化将图像转换为潜在表示,与直接处理像素相比,降低了计算需求,并增强了生成过程的效果和效率。先前的方法,如VQGAN,通常利用具有固定下采样因子的2D潜在网格。然而,这些2D记号化在处理图像中存在的固有冗余方面面临挑战,其中相邻区域经常显示相似性。为了克服这一问题,我们引入了基于Transformer的一维记号化器(TiTok),这是一种创新方法,将图像记号化为一维潜在序列。TiTok提供了更紧凑的潜在表示,比传统技术产生了更高效和有效的表示。例如,一个256 x 256 x 3的图像可以仅减少到32个离散记号,这与先前方法获得的256或1024个记号相比显著减少。尽管TiTok具有紧凑的特性,但在与最先进方法相同的生成器框架下,TiTok实现了竞争性能。具体而言,在ImageNet 256 x 256基准测试中,TiTok达到了1.97的gFID,明显优于MaskGIT基线4.21。当涉及更高分辨率时,TiTok的优势变得更加显著。在ImageNet 512 x 512基准测试中,TiTok不仅优于最先进的扩散模型DiT-XL/2(gFID 2.74 vs. 3.04),还将图像记号减少了64倍,导致生成过程快410倍。我们表现最佳的变体可以显著超越DiT-XL/2(gFID 2.13 vs. 3.04),同时生成高质量样本快74倍。
English
Recent advancements in generative models have highlighted the crucial role of image tokenization in the efficient synthesis of high-resolution images. Tokenization, which transforms images into latent representations, reduces computational demands compared to directly processing pixels and enhances the effectiveness and efficiency of the generation process. Prior methods, such as VQGAN, typically utilize 2D latent grids with fixed downsampling factors. However, these 2D tokenizations face challenges in managing the inherent redundancies present in images, where adjacent regions frequently display similarities. To overcome this issue, we introduce Transformer-based 1-Dimensional Tokenizer (TiTok), an innovative approach that tokenizes images into 1D latent sequences. TiTok provides a more compact latent representation, yielding substantially more efficient and effective representations than conventional techniques. For example, a 256 x 256 x 3 image can be reduced to just 32 discrete tokens, a significant reduction from the 256 or 1024 tokens obtained by prior methods. Despite its compact nature, TiTok achieves competitive performance to state-of-the-art approaches. Specifically, using the same generator framework, TiTok attains 1.97 gFID, outperforming MaskGIT baseline significantly by 4.21 at ImageNet 256 x 256 benchmark. The advantages of TiTok become even more significant when it comes to higher resolution. At ImageNet 512 x 512 benchmark, TiTok not only outperforms state-of-the-art diffusion model DiT-XL/2 (gFID 2.74 vs. 3.04), but also reduces the image tokens by 64x, leading to 410x faster generation process. Our best-performing variant can significantly surpasses DiT-XL/2 (gFID 2.13 vs. 3.04) while still generating high-quality samples 74x faster.

Summary

AI-Generated Summary

PDF6020December 8, 2024