邻域自回归建模用于高效视觉生成
Neighboring Autoregressive Modeling for Efficient Visual Generation
March 12, 2025
作者: Yefei He, Yuanyu He, Shaoxuan He, Feng Chen, Hong Zhou, Kaipeng Zhang, Bohan Zhuang
cs.AI
摘要
视觉自回归模型通常遵循一种光栅顺序的“下一令牌预测”范式,这种范式忽视了视觉内容中固有的空间与时间局部性。具体而言,视觉令牌与其空间或时间上相邻的令牌之间的相关性,远强于与远处令牌的相关性。本文提出了一种新颖的范式——邻近自回归建模(NAR),它将自回归视觉生成表述为一个渐进式外推过程,遵循由近及远的“下一邻居预测”机制。从初始令牌出发,其余令牌按照其在时空空间中与初始令牌的曼哈顿距离升序解码,逐步扩展已解码区域的边界。为了实现时空空间中多个相邻令牌的并行预测,我们引入了一组面向维度的解码头,每个解码头沿相互正交的维度预测下一个令牌。在推理过程中,所有与已解码令牌相邻的令牌均被并行处理,从而大幅减少了生成所需的模型前向步骤。在ImageNet256×256和UCF101上的实验表明,与PAR-4X方法相比,NAR在图像和视频生成任务中分别实现了2.4倍和8.6倍的吞吐量提升,同时获得了更优的FID/FVD分数。在文本到图像生成基准测试GenEval上评估时,拥有0.8B参数的NAR在仅使用0.4倍训练数据的情况下,表现优于Chameleon-7B。代码已发布于https://github.com/ThisisBillhe/NAR。
English
Visual autoregressive models typically adhere to a raster-order ``next-token
prediction" paradigm, which overlooks the spatial and temporal locality
inherent in visual content. Specifically, visual tokens exhibit significantly
stronger correlations with their spatially or temporally adjacent tokens
compared to those that are distant. In this paper, we propose Neighboring
Autoregressive Modeling (NAR), a novel paradigm that formulates autoregressive
visual generation as a progressive outpainting procedure, following a
near-to-far ``next-neighbor prediction" mechanism. Starting from an initial
token, the remaining tokens are decoded in ascending order of their Manhattan
distance from the initial token in the spatial-temporal space, progressively
expanding the boundary of the decoded region. To enable parallel prediction of
multiple adjacent tokens in the spatial-temporal space, we introduce a set of
dimension-oriented decoding heads, each predicting the next token along a
mutually orthogonal dimension. During inference, all tokens adjacent to the
decoded tokens are processed in parallel, substantially reducing the model
forward steps for generation. Experiments on ImageNet256times 256 and UCF101
demonstrate that NAR achieves 2.4times and 8.6times higher throughput
respectively, while obtaining superior FID/FVD scores for both image and video
generation tasks compared to the PAR-4X approach. When evaluating on
text-to-image generation benchmark GenEval, NAR with 0.8B parameters
outperforms Chameleon-7B while using merely 0.4 of the training data. Code is
available at https://github.com/ThisisBillhe/NAR.Summary
AI-Generated Summary