局部感知並行解碼於高效自迴歸影像生成之應用

摘要

本文提出了一種基於局部感知的並行解碼方法（Locality-aware Parallel Decoding, LPD），以加速自迴歸圖像生成。傳統的自迴歸圖像生成依賴於下一塊預測，這是一個受內存限制的過程，導致高延遲。現有研究嘗試通過轉向多塊預測來並行化下一塊預測以加速該過程，但僅實現了有限的並行化。為了在保持生成質量的同時實現高度並行化，我們引入了兩項關鍵技術：（1）靈活的並行化自迴歸建模，這是一種新穎的架構，允許任意生成順序和並行化程度。它使用可學習的位置查詢標記來指導目標位置的生成，同時確保並行生成的標記之間的相互可見性，以實現一致的並行解碼。（2）局部感知的生成順序，這是一種新穎的調度策略，通過形成組來最小化組內依賴性並最大化上下文支持，從而提升生成質量。通過這些設計，我們在不影響ImageNet類別條件生成質量的情況下，將生成步驟從256減少到20（256×256分辨率）和1024減少到48（512×512分辨率），並且實現了至少比先前並行化自迴歸模型低3.4倍的延遲。

English

We present Locality-aware Parallel Decoding (LPD) to accelerate autoregressive image generation. Traditional autoregressive image generation relies on next-patch prediction, a memory-bound process that leads to high latency. Existing works have tried to parallelize next-patch prediction by shifting to multi-patch prediction to accelerate the process, but only achieved limited parallelization. To achieve high parallelization while maintaining generation quality, we introduce two key techniques: (1) Flexible Parallelized Autoregressive Modeling, a novel architecture that enables arbitrary generation ordering and degrees of parallelization. It uses learnable position query tokens to guide generation at target positions while ensuring mutual visibility among concurrently generated tokens for consistent parallel decoding. (2) Locality-aware Generation Ordering, a novel schedule that forms groups to minimize intra-group dependencies and maximize contextual support, enhancing generation quality. With these designs, we reduce the generation steps from 256 to 20 (256times256 res.) and 1024 to 48 (512times512 res.) without compromising quality on the ImageNet class-conditional generation, and achieving at least 3.4times lower latency than previous parallelized autoregressive models.

局部感知並行解碼於高效自迴歸影像生成之應用

Locality-aware Parallel Decoding for Efficient Autoregressive Image Generation

摘要

Support