ピクセル空間自己回帰画像生成のための並列ロールアウト近似

要旨

ピクセル空間連続トークン自己回帰（AR）生成は、画像を直接生のピクセルパッチの系列としてモデル化し、離散的なトークン化や個別に事前学習されたトークナイザを不要とする。しかし、高次元パッチ生成による大きな単一段階誤差と、教師強制訓練による訓練-推論ギャップ（これにより誤差がAR段階で蓄積する）という、結合した二つの課題に直面する。x予測や入力ノイズ注入などの既存の改善策はこれらの問題を部分的にしか緩和しない。厳密なロールアウト訓練は推論時の条件により適合するが、逐次的なサンプリングが著しく低速なため非現実的である。我々は、これら二つの課題を同時に扱うスケーラブルな枠組みである並列ロールアウト近似（PRA）を提案する。PRAは高次元ピクセルパッチの代わりに低次元の中間状態を生成し、それをピクセルデコーダでピクセル空間トークンに戻すことで、ピクセル入力・ピクセル出力のARインタフェースを維持する。また、推論時と同一の中間状態-ピクセル経路を通じて推論に類似したピクセル入力を位置ごとに独立に構築し、推論時のロールアウトで遭遇するピクセルフィードバックインタフェースを近似しつつ、並列的な教師強制訓練を保持する。256×256解像度のクラス条件付きImageNet-1K生成において、135MパラメータのPRA-SはFID 2.58を達成し、従来の10億スケールのピクセル空間AR結果である3.60を上回った。511MパラメータのPRA-LへのスケーリングによりFIDは1.94に向上し、ピクセル空間ARモデルの中で新たな最先端を確立した。生成性能に加え、PRAは他のARや拡散ベースラインよりも高いImageNet分類プロービング精度を達成し、統一的なピクセル空間画像生成と理解への可能性を示している。

English

Pixel-space continuous-token autoregressive (AR) generation directly models images as sequences of raw pixel patches, avoiding discrete tokenization or a separately pretrained tokenizer. However, it faces coupled challenges: high-dimensional patch generation causes large single-step errors, and teacher-forced training creates a train--inference gap that makes these errors accumulate across AR steps. Existing fixes such as x-prediction and input noise injection only partially mitigate these issues. Exact rollout training better matches inference-time conditions, but is impractical due to prohibitively slow sequential sampling. We propose Parallel Rollout Approximation (PRA), a scalable framework that addresses both challenges jointly. PRA generates low-dimensional intermediate states instead of high-dimensional pixel patches, then maps them back to pixel-space tokens with a pixel decoder, preserving a pixel-in, pixel-out AR interface. It also constructs inference-like pixel inputs through the same intermediate-state-to-pixel path used at inference, independently across positions, approximating the pixel-feedback interface encountered during inference-time rollout while retaining parallel teacher-forced training. On class-conditional ImageNet-1K generation at 256times256 resolution, PRA-S with 135M parameters achieves an FID of 2.58, surpassing the previous billion-scale pixel-space AR result of 3.60. Scaling to PRA-L with 511M parameters further improves FID to 1.94, establishing a new state of the art among pixel-space AR models. Beyond generation, PRA achieves higher ImageNet classification probing accuracy than other AR and diffusion baselines, suggesting its potential for unified pixel-space image generation and understanding.