基於隨機化平行解碼的自回歸圖像生成
Autoregressive Image Generation with Randomized Parallel Decoding
March 13, 2025
作者: Haopeng Li, Jinyue Yang, Guoqi Li, Huan Wang
cs.AI
摘要
我們提出了ARPG,這是一種新穎的視覺自回歸模型,它能夠實現隨機並行生成,從而解決了傳統光柵順序方法固有的限制。這些限制由於其順序性、預定義的標記生成順序,阻礙了推理效率和零樣本泛化能力。我們的關鍵洞見是,有效的隨機順序建模需要明確的指導來確定下一個預測標記的位置。為此,我們提出了一種新穎的引導解碼框架,該框架將位置引導與內容表示解耦,分別將其編碼為查詢和鍵值對。通過直接將這種引導整合到因果注意力機制中,我們的方法實現了完全隨機順序的訓練和生成,消除了對雙向注意力的需求。因此,ARPG能夠輕鬆泛化到零樣本任務,如圖像修補、擴展和分辨率提升。此外,它通過使用共享的KV緩存並行處理多個查詢來支持並行推理。在ImageNet-1K 256基準測試中,我們的方法僅用64個採樣步驟就達到了1.94的FID,與類似規模的代表性近期自回歸模型相比,吞吐量提高了20倍以上,同時減少了超過75%的內存消耗。
English
We introduce ARPG, a novel visual autoregressive model that enables
randomized parallel generation, addressing the inherent limitations of
conventional raster-order approaches, which hinder inference efficiency and
zero-shot generalization due to their sequential, predefined token generation
order. Our key insight is that effective random-order modeling necessitates
explicit guidance for determining the position of the next predicted token. To
this end, we propose a novel guided decoding framework that decouples
positional guidance from content representation, encoding them separately as
queries and key-value pairs. By directly incorporating this guidance into the
causal attention mechanism, our approach enables fully random-order training
and generation, eliminating the need for bidirectional attention. Consequently,
ARPG readily generalizes to zero-shot tasks such as image inpainting,
outpainting, and resolution expansion. Furthermore, it supports parallel
inference by concurrently processing multiple queries using a shared KV cache.
On the ImageNet-1K 256 benchmark, our approach attains an FID of 1.94 with only
64 sampling steps, achieving over a 20-fold increase in throughput while
reducing memory consumption by over 75% compared to representative recent
autoregressive models at a similar scale.Summary
AI-Generated Summary