超球面潛在變量提升連續詞元自回歸生成效能

摘要

自迴歸（AR）模型在圖像生成領域展現出巨大潛力，然而連續token的AR變體往往落後於潛在擴散和掩碼生成模型。其核心問題在於VAE潛變量中的異質性方差，這在AR解碼過程中，尤其是在無分類器指導（CFG）下，會被放大，可能導致方差崩潰。為解決這一問題，我們提出了SphereAR。其核心設計是將所有AR輸入和輸出——包括CFG之後的——約束在固定半徑的超球面上（恆定ℓ₂範數），並利用超球面VAE。我們的理論分析表明，超球面約束消除了尺度分量（方差崩潰的主要原因），從而穩定AR解碼。在ImageNet生成任務中，SphereAR-H（943M）為AR模型設定了新的技術標準，達到了FID 1.34。即使在較小規模下，SphereAR-L（479M）也達到了FID 1.54，SphereAR-B（208M）達到了1.92，匹配或超越了如MAR-H（943M, 1.55）和VAR-d30（2B, 1.92）等更大規模的基線模型。據我們所知，這是首次純粹基於下一個token的AR圖像生成器，在光柵順序下，在相當參數規模上超越了擴散和掩碼生成模型。

English

Autoregressive (AR) models are promising for image generation, yet continuous-token AR variants often trail latent diffusion and masked-generation models. The core issue is heterogeneous variance in VAE latents, which is amplified during AR decoding, especially under classifier-free guidance (CFG), and can cause variance collapse. We propose SphereAR to address this issue. Its core design is to constrain all AR inputs and outputs -- including after CFG -- to lie on a fixed-radius hypersphere (constant ell_2 norm), leveraging hyperspherical VAEs. Our theoretical analysis shows that hyperspherical constraint removes the scale component (the primary cause of variance collapse), thereby stabilizing AR decoding. Empirically, on ImageNet generation, SphereAR-H (943M) sets a new state of the art for AR models, achieving FID 1.34. Even at smaller scales, SphereAR-L (479M) reaches FID 1.54 and SphereAR-B (208M) reaches 1.92, matching or surpassing much larger baselines such as MAR-H (943M, 1.55) and VAR-d30 (2B, 1.92). To our knowledge, this is the first time a pure next-token AR image generator with raster order surpasses diffusion and masked-generation models at comparable parameter scales.

超球面潛在變量提升連續詞元自回歸生成效能

Hyperspherical Latents Improve Continuous-Token Autoregressive Generation

摘要

Support