超球面潜在空间提升连续令牌自回归生成性能

摘要

自回归（AR）模型在图像生成领域展现出巨大潜力，然而连续令牌的AR变体往往落后于潜在扩散模型和掩码生成模型。其核心问题在于VAE潜在空间中的异质性方差，这种方差在AR解码过程中被放大，尤其是在无分类器指导（CFG）下，可能导致方差崩溃。为此，我们提出了SphereAR来解决这一问题。其核心设计是将所有AR输入和输出——包括CFG后的结果——约束在一个固定半径的超球面上（保持恒定的ℓ₂范数），并利用超球面VAE。我们的理论分析表明，超球面约束消除了尺度分量（方差崩溃的主要原因），从而稳定了AR解码过程。实验证明，在ImageNet生成任务中，SphereAR-H（943M参数）为AR模型树立了新的标杆，达到了FID 1.34。即便在较小规模下，SphereAR-L（479M参数）也实现了FID 1.54，而SphereAR-B（208M参数）则达到了1.92，与更大规模的基线模型如MAR-H（943M参数，1.55）和VAR-d30（2B参数，1.92）相比，表现相当或更优。据我们所知，这是首次纯基于下一令牌预测、采用光栅顺序的自回归图像生成器，在参数规模相当的情况下，超越了扩散模型和掩码生成模型。

English

Autoregressive (AR) models are promising for image generation, yet continuous-token AR variants often trail latent diffusion and masked-generation models. The core issue is heterogeneous variance in VAE latents, which is amplified during AR decoding, especially under classifier-free guidance (CFG), and can cause variance collapse. We propose SphereAR to address this issue. Its core design is to constrain all AR inputs and outputs -- including after CFG -- to lie on a fixed-radius hypersphere (constant ell_2 norm), leveraging hyperspherical VAEs. Our theoretical analysis shows that hyperspherical constraint removes the scale component (the primary cause of variance collapse), thereby stabilizing AR decoding. Empirically, on ImageNet generation, SphereAR-H (943M) sets a new state of the art for AR models, achieving FID 1.34. Even at smaller scales, SphereAR-L (479M) reaches FID 1.54 and SphereAR-B (208M) reaches 1.92, matching or surpassing much larger baselines such as MAR-H (943M, 1.55) and VAR-d30 (2B, 1.92). To our knowledge, this is the first time a pure next-token AR image generator with raster order surpasses diffusion and masked-generation models at comparable parameter scales.

超球面潜在空间提升连续令牌自回归生成性能

Hyperspherical Latents Improve Continuous-Token Autoregressive Generation

摘要

Support