ランダム化並列デコードを用いた自己回帰的画像生成

要旨

我々はARPGを紹介する。これは、従来のラスター順アプローチが持つ推論効率とゼロショット汎化性能の制約を解決する、新しい視覚的自動回帰モデルである。従来手法では、順次的で事前定義されたトークン生成順序がこれらの制約の原因となっていた。我々の重要な洞察は、効果的なランダム順序モデリングには、次に予測するトークンの位置を決定するための明示的なガイダンスが必要だということである。この目的のために、位置ガイダンスとコンテンツ表現を分離し、それぞれをクエリとキー・バリューペアとしてエンコードする新しいガイド付きデコードフレームワークを提案する。このガイダンスを因果的注意機構に直接組み込むことで、我々のアプローチは完全なランダム順序の学習と生成を可能にし、双方向注意の必要性を排除する。その結果、ARPGは画像修復、拡張、解像度拡張などのゼロショットタスクに容易に汎化できる。さらに、共有KVキャッシュを使用して複数のクエリを並列処理することで、並列推論をサポートする。ImageNet-1K 256ベンチマークにおいて、我々のアプローチはわずか64サンプリングステップでFID 1.94を達成し、同規模の代表的な最近の自動回帰モデルと比較して、スループットを20倍以上向上させ、メモリ消費を75%以上削減した。

English

We introduce ARPG, a novel visual autoregressive model that enables randomized parallel generation, addressing the inherent limitations of conventional raster-order approaches, which hinder inference efficiency and zero-shot generalization due to their sequential, predefined token generation order. Our key insight is that effective random-order modeling necessitates explicit guidance for determining the position of the next predicted token. To this end, we propose a novel guided decoding framework that decouples positional guidance from content representation, encoding them separately as queries and key-value pairs. By directly incorporating this guidance into the causal attention mechanism, our approach enables fully random-order training and generation, eliminating the need for bidirectional attention. Consequently, ARPG readily generalizes to zero-shot tasks such as image inpainting, outpainting, and resolution expansion. Furthermore, it supports parallel inference by concurrently processing multiple queries using a shared KV cache. On the ImageNet-1K 256 benchmark, our approach attains an FID of 1.94 with only 64 sampling steps, achieving over a 20-fold increase in throughput while reducing memory consumption by over 75% compared to representative recent autoregressive models at a similar scale.

ランダム化並列デコードを用いた自己回帰的画像生成

Autoregressive Image Generation with Randomized Parallel Decoding

要旨

Support