次のトークンを超えて：自己回帰的なビジュアル生成のための次のX予測

要旨

自己回帰（AR）モデリングは、次のトークン予測パラダイムで知られており、最先端の言語生成モデルや視覚生成モデルの基盤となっています。従来、"トークン"は、言語では離散的な記号、視覚では量子化されたパッチなど、しばしば最小の予測単位として扱われてきました。しかし、2次元画像構造に対する最適なトークン定義は未解決の問題です。さらに、ARモデルは露出バイアスに苦しんでおり、トレーニング中の教師強制が推論時の誤差蓄積につながります。本論文では、トークンを個々のパッチトークン、セル（k x kの近隣パッチグループ）、サブサンプル（遠隔パッチの非局所グループ）、スケール（粗いから細かい解像度）、あるいは全体の画像を表すエンティティXに拡張するxARという汎用ARフレームワークを提案します。さらに、離散的なトークン分類を連続的なエンティティ回帰として再定式化し、各ARステップでフローマッチング手法を活用します。このアプローチにより、訓練を正確なトークンではなくノイズのあるエンティティに依存させることで、露出バイアスを効果的に緩和するノイジーコンテキスト学習が可能となります。その結果、xARには2つの主要な利点があります：（1）異なる文脈の粒度や空間構造を捉える柔軟な予測ユニットを可能にし、（2）教師強制に依存せず露出バイアスを軽減します。ImageNet-256生成ベンチマークでは、当社のベースモデルであるxAR-B（172M）がDiT-XL/SiT-XL（675M）を上回り、推論速度は20倍速くなりました。一方、xAR-HはFID値が1.24となり、以前の最高性能モデルよりも2.2倍速く動作し、視覚基盤モジュール（例：DINOv2）や高度なガイダンス間隔サンプリングに依存せず、新たな最先端を確立しました。

English

Autoregressive (AR) modeling, known for its next-token prediction paradigm, underpins state-of-the-art language and visual generative models. Traditionally, a ``token'' is treated as the smallest prediction unit, often a discrete symbol in language or a quantized patch in vision. However, the optimal token definition for 2D image structures remains an open question. Moreover, AR models suffer from exposure bias, where teacher forcing during training leads to error accumulation at inference. In this paper, we propose xAR, a generalized AR framework that extends the notion of a token to an entity X, which can represent an individual patch token, a cell (a ktimes k grouping of neighboring patches), a subsample (a non-local grouping of distant patches), a scale (coarse-to-fine resolution), or even a whole image. Additionally, we reformulate discrete token classification as continuous entity regression, leveraging flow-matching methods at each AR step. This approach conditions training on noisy entities instead of ground truth tokens, leading to Noisy Context Learning, which effectively alleviates exposure bias. As a result, xAR offers two key advantages: (1) it enables flexible prediction units that capture different contextual granularity and spatial structures, and (2) it mitigates exposure bias by avoiding reliance on teacher forcing. On ImageNet-256 generation benchmark, our base model, xAR-B (172M), outperforms DiT-XL/SiT-XL (675M) while achieving 20times faster inference. Meanwhile, xAR-H sets a new state-of-the-art with an FID of 1.24, running 2.2times faster than the previous best-performing model without relying on vision foundation modules (\eg, DINOv2) or advanced guidance interval sampling.

次のトークンを超えて：自己回帰的なビジュアル生成のための次のX予測

Beyond Next-Token: Next-X Prediction for Autoregressive Visual Generation

要旨

Support