ZipAR：空間局所性を通じた自己回帰画像生成の高速化

要旨

本論文では、自己回帰（AR）ビジュアル生成を加速するためのトレーニング不要でプラグアンドプレイな並列デコーディングフレームワークであるZipARを提案します。この動機は、画像が局所構造を示し、空間的に離れた領域が最小限の相互依存関係を持つという観察に基づいています。視覚トークンの部分的にデコードされたセットが与えられた場合、行次元での元の次トークン予測スキームに加えて、列次元で空間的に隣接する領域に対応するトークンを並列にデコードすることで、「次のセット予測」パラダイムが可能となります。単一のフォワードパスで複数のトークンを同時にデコードすることにより、画像を生成するために必要なフォワードパスの数が大幅に削減され、生成効率が著しく向上します。実験では、ZipARがEmu3-Genモデルにおいて、追加の再トレーニングを必要とせずに、モデルのフォワードパスの数を最大91％削減できることが示されています。

English

In this paper, we propose ZipAR, a training-free, plug-and-play parallel decoding framework for accelerating auto-regressive (AR) visual generation. The motivation stems from the observation that images exhibit local structures, and spatially distant regions tend to have minimal interdependence. Given a partially decoded set of visual tokens, in addition to the original next-token prediction scheme in the row dimension, the tokens corresponding to spatially adjacent regions in the column dimension can be decoded in parallel, enabling the ``next-set prediction'' paradigm. By decoding multiple tokens simultaneously in a single forward pass, the number of forward passes required to generate an image is significantly reduced, resulting in a substantial improvement in generation efficiency. Experiments demonstrate that ZipAR can reduce the number of model forward passes by up to 91% on the Emu3-Gen model without requiring any additional retraining.

ZipAR：空間局所性を通じた自己回帰画像生成の高速化

ZipAR: Accelerating Autoregressive Image Generation through Spatial Locality

要旨

Support