ChatPaper.aiChatPaper

一張圖片值得32個標記進行重建和生成。

An Image is Worth 32 Tokens for Reconstruction and Generation

June 11, 2024
作者: Qihang Yu, Mark Weber, Xueqing Deng, Xiaohui Shen, Daniel Cremers, Liang-Chieh Chen
cs.AI

摘要

最近生成模型的進步突顯了影像標記化在高分辨率圖像有效合成中的關鍵作用。標記化將圖像轉換為潛在表示,相較於直接處理像素,降低了計算需求,增強了生成過程的效果和效率。先前的方法,如VQGAN,通常使用具有固定下採樣因子的2D潛在網格。然而,這些2D標記化在處理圖像中存在的固有冗餘時面臨挑戰,其中相鄰區域經常呈現相似性。為了克服這個問題,我們引入了基於Transformer的一維標記化器(TiTok),這是一種將圖像標記化為一維潛在序列的創新方法。TiTok提供了更緊湊的潛在表示,比傳統技術產生了更高效和有效的表示。例如,一個256 x 256 x 3的圖像可以被縮減為僅32個離散標記,這與先前方法獲得的256或1024個標記相比有顯著的減少。儘管其緊湊的特性,TiTok在與最先進方法的性能上達到了競爭力。具體而言,在相同的生成器框架下,TiTok在ImageNet 256 x 256基準測試中達到了1.97的gFID,明顯優於MaskGIT基線4.21。當涉及更高分辨率時,TiTok的優勢變得更加顯著。在ImageNet 512 x 512基準測試中,TiTok不僅優於最先進的擴散模型DiT-XL/2(gFID 2.74 vs. 3.04),還將圖像標記減少了64倍,導致生成過程快410倍。我們表現最佳的變體可以顯著超越DiT-XL/2(gFID 2.13 vs. 3.04),同時生成高質量樣本快74倍。
English
Recent advancements in generative models have highlighted the crucial role of image tokenization in the efficient synthesis of high-resolution images. Tokenization, which transforms images into latent representations, reduces computational demands compared to directly processing pixels and enhances the effectiveness and efficiency of the generation process. Prior methods, such as VQGAN, typically utilize 2D latent grids with fixed downsampling factors. However, these 2D tokenizations face challenges in managing the inherent redundancies present in images, where adjacent regions frequently display similarities. To overcome this issue, we introduce Transformer-based 1-Dimensional Tokenizer (TiTok), an innovative approach that tokenizes images into 1D latent sequences. TiTok provides a more compact latent representation, yielding substantially more efficient and effective representations than conventional techniques. For example, a 256 x 256 x 3 image can be reduced to just 32 discrete tokens, a significant reduction from the 256 or 1024 tokens obtained by prior methods. Despite its compact nature, TiTok achieves competitive performance to state-of-the-art approaches. Specifically, using the same generator framework, TiTok attains 1.97 gFID, outperforming MaskGIT baseline significantly by 4.21 at ImageNet 256 x 256 benchmark. The advantages of TiTok become even more significant when it comes to higher resolution. At ImageNet 512 x 512 benchmark, TiTok not only outperforms state-of-the-art diffusion model DiT-XL/2 (gFID 2.74 vs. 3.04), but also reduces the image tokens by 64x, leading to 410x faster generation process. Our best-performing variant can significantly surpasses DiT-XL/2 (gFID 2.13 vs. 3.04) while still generating high-quality samples 74x faster.

Summary

AI-Generated Summary

PDF6020December 8, 2024