超越下一個標記：自回歸視覺生成的下一個X預測

摘要

自回歸（AR）建模以其下一個標記預測範式而聞名，支撐著最先進的語言和視覺生成模型。傳統上，“標記”被視為最小的預測單位，通常是語言中的離散符號或視覺中的量化補丁。然而，對於二維圖像結構的最佳標記定義仍然是一個懸而未決的問題。此外，AR 模型存在著曝光偏差問題，即在訓練期間的教師強迫導致推理時的錯誤累積。在本文中，我們提出了 xAR，一個廣義的 AR 框架，將標記的概念擴展到一個實體 X，可以代表單個補丁標記、單元（相鄰補丁的 k 次 k 組合）、子採樣（遠處補丁的非局部組合）、尺度（從粗到細的解析度），甚至是整個圖像。此外，我們將離散標記分類重構為連續實體回歸，利用每個 AR 步驟的流匹配方法。這種方法使訓練依賴於嘈雜的實體而不是地面真實標記，從而實現了有效緩解曝光偏差的嘈雜上下文學習。因此，xAR 提供了兩個關鍵優勢：（1）它實現了靈活的預測單位，捕捉不同的上下文粒度和空間結構，（2）通過避免依賴教師強迫，緩解了曝光偏差。在 ImageNet-256 生成基準測試中，我們的基本模型 xAR-B（172M）在實現 20 倍更快的推理的同時，勝過了 DiT-XL/SiT-XL（675M）。與此同時，xAR-H 以 1.24 的 FID 設置了一個新的最先進水準，運行速度比之前表現最佳的模型快 2.2 倍，而不依賴於視覺基礎模塊（例如 DINOv2）或高級引導間隔採樣。

English

Autoregressive (AR) modeling, known for its next-token prediction paradigm, underpins state-of-the-art language and visual generative models. Traditionally, a ``token'' is treated as the smallest prediction unit, often a discrete symbol in language or a quantized patch in vision. However, the optimal token definition for 2D image structures remains an open question. Moreover, AR models suffer from exposure bias, where teacher forcing during training leads to error accumulation at inference. In this paper, we propose xAR, a generalized AR framework that extends the notion of a token to an entity X, which can represent an individual patch token, a cell (a ktimes k grouping of neighboring patches), a subsample (a non-local grouping of distant patches), a scale (coarse-to-fine resolution), or even a whole image. Additionally, we reformulate discrete token classification as continuous entity regression, leveraging flow-matching methods at each AR step. This approach conditions training on noisy entities instead of ground truth tokens, leading to Noisy Context Learning, which effectively alleviates exposure bias. As a result, xAR offers two key advantages: (1) it enables flexible prediction units that capture different contextual granularity and spatial structures, and (2) it mitigates exposure bias by avoiding reliance on teacher forcing. On ImageNet-256 generation benchmark, our base model, xAR-B (172M), outperforms DiT-XL/SiT-XL (675M) while achieving 20times faster inference. Meanwhile, xAR-H sets a new state-of-the-art with an FID of 1.24, running 2.2times faster than the previous best-performing model without relying on vision foundation modules (\eg, DINOv2) or advanced guidance interval sampling.

超越下一個標記：自回歸視覺生成的下一個X預測

Beyond Next-Token: Next-X Prediction for Autoregressive Visual Generation

摘要

Support