鄰域自回歸建模用於高效視覺生成
Neighboring Autoregressive Modeling for Efficient Visual Generation
March 12, 2025
作者: Yefei He, Yuanyu He, Shaoxuan He, Feng Chen, Hong Zhou, Kaipeng Zhang, Bohan Zhuang
cs.AI
摘要
視覺自回歸模型通常遵循一種光柵順序的「下一個標記預測」範式,這種方法忽略了視覺內容中固有的空間和時間局部性。具體而言,視覺標記與其空間或時間上相鄰的標記之間的相關性,遠比與遠處標記的相關性更強。在本文中,我們提出了鄰近自回歸建模(NAR),這是一種新穎的範式,將自回歸視覺生成建模為一種逐步外繪的過程,遵循從近到遠的「下一個鄰居預測」機制。從初始標記開始,剩餘的標記按照它們在時空空間中與初始標記的曼哈頓距離的升序進行解碼,逐步擴展解碼區域的邊界。為了實現時空空間中多個相鄰標記的並行預測,我們引入了一組面向維度的解碼頭,每個解碼頭沿著相互正交的維度預測下一個標記。在推理過程中,所有與已解碼標記相鄰的標記都會被並行處理,從而大幅減少生成過程中的模型前向步驟。在ImageNet256×256和UCF101上的實驗表明,與PAR-4X方法相比,NAR分別實現了2.4倍和8.6倍的吞吐量提升,同時在圖像和視頻生成任務中獲得了更優的FID/FVD分數。在文本到圖像生成基準測試GenEval上評估時,擁有0.8B參數的NAR在僅使用0.4倍訓練數據的情況下,表現優於Chameleon-7B。代碼可在https://github.com/ThisisBillhe/NAR獲取。
English
Visual autoregressive models typically adhere to a raster-order ``next-token
prediction" paradigm, which overlooks the spatial and temporal locality
inherent in visual content. Specifically, visual tokens exhibit significantly
stronger correlations with their spatially or temporally adjacent tokens
compared to those that are distant. In this paper, we propose Neighboring
Autoregressive Modeling (NAR), a novel paradigm that formulates autoregressive
visual generation as a progressive outpainting procedure, following a
near-to-far ``next-neighbor prediction" mechanism. Starting from an initial
token, the remaining tokens are decoded in ascending order of their Manhattan
distance from the initial token in the spatial-temporal space, progressively
expanding the boundary of the decoded region. To enable parallel prediction of
multiple adjacent tokens in the spatial-temporal space, we introduce a set of
dimension-oriented decoding heads, each predicting the next token along a
mutually orthogonal dimension. During inference, all tokens adjacent to the
decoded tokens are processed in parallel, substantially reducing the model
forward steps for generation. Experiments on ImageNet256times 256 and UCF101
demonstrate that NAR achieves 2.4times and 8.6times higher throughput
respectively, while obtaining superior FID/FVD scores for both image and video
generation tasks compared to the PAR-4X approach. When evaluating on
text-to-image generation benchmark GenEval, NAR with 0.8B parameters
outperforms Chameleon-7B while using merely 0.4 of the training data. Code is
available at https://github.com/ThisisBillhe/NAR.Summary
AI-Generated Summary