効率的な視覚生成のための近傍自己回帰モデリング

要旨

視覚的自動回帰モデルは通常、ラスター順の「次トークン予測」パラダイムに従っており、視覚コンテンツに内在する空間的・時間的局所性を見落としています。具体的には、視覚トークンは、遠く離れたトークンと比較して、空間的または時間的に隣接するトークンとの相関が著しく強くなります。本論文では、近接自動回帰モデリング（Neighboring Autoregressive Modeling, NAR）という新しいパラダイムを提案します。これは、自動回帰的視覚生成を、近接から遠方への「次隣接トークン予測」メカニズムに従った漸進的なアウトペインティング手順として定式化します。初期トークンから開始し、残りのトークンは、空間-時間空間における初期トークンからのマンハッタン距離の昇順でデコードされ、デコード領域の境界を徐々に拡張します。空間-時間空間内の複数の隣接トークンを並列に予測するために、相互に直交する次元に沿って次のトークンを予測する次元指向デコードヘッドを導入します。推論時には、デコードされたトークンに隣接するすべてのトークンが並列に処理され、生成のためのモデルのフォワードステップが大幅に削減されます。ImageNet256×256およびUCF101での実験により、NARはそれぞれ2.4倍および8.6倍のスループット向上を達成し、画像および動画生成タスクにおいてPAR-4Xアプローチと比較して優れたFID/FVDスコアを獲得することが示されました。テキストから画像生成のベンチマークGenEvalで評価した場合、0.8BパラメータのNARは、Chameleon-7Bを上回りながら、トレーニングデータのわずか0.4倍を使用しています。コードはhttps://github.com/ThisisBillhe/NARで公開されています。

English

Visual autoregressive models typically adhere to a raster-order ``next-token prediction" paradigm, which overlooks the spatial and temporal locality inherent in visual content. Specifically, visual tokens exhibit significantly stronger correlations with their spatially or temporally adjacent tokens compared to those that are distant. In this paper, we propose Neighboring Autoregressive Modeling (NAR), a novel paradigm that formulates autoregressive visual generation as a progressive outpainting procedure, following a near-to-far ``next-neighbor prediction" mechanism. Starting from an initial token, the remaining tokens are decoded in ascending order of their Manhattan distance from the initial token in the spatial-temporal space, progressively expanding the boundary of the decoded region. To enable parallel prediction of multiple adjacent tokens in the spatial-temporal space, we introduce a set of dimension-oriented decoding heads, each predicting the next token along a mutually orthogonal dimension. During inference, all tokens adjacent to the decoded tokens are processed in parallel, substantially reducing the model forward steps for generation. Experiments on ImageNet256times 256 and UCF101 demonstrate that NAR achieves 2.4times and 8.6times higher throughput respectively, while obtaining superior FID/FVD scores for both image and video generation tasks compared to the PAR-4X approach. When evaluating on text-to-image generation benchmark GenEval, NAR with 0.8B parameters outperforms Chameleon-7B while using merely 0.4 of the training data. Code is available at https://github.com/ThisisBillhe/NAR.

効率的な視覚生成のための近傍自己回帰モデリング

Neighboring Autoregressive Modeling for Efficient Visual Generation

要旨

Support