超球面潜在変数による連続トークン自己回帰生成の改善

要旨

自己回帰（AR）モデルは画像生成において有望であるが、連続トークンのAR変種はしばしば潜在拡散モデルやマスク生成モデルに後れを取る。その核心的な問題は、VAE潜在空間における不均一な分散であり、これは特に分類器不要ガイダンス（CFG）下でのARデコード中に増幅され、分散崩壊を引き起こす可能性がある。この問題に対処するため、我々はSphereARを提案する。その核心的な設計は、すべてのAR入力と出力（CFG後を含む）を固定半径の超球面上（一定のℓ₂ノルム）に制約することで、超球面VAEを活用するものである。理論的分析により、超球面制約がスケール成分（分散崩壊の主な原因）を除去し、それによってARデコードを安定化することが示された。実験的には、ImageNet生成において、SphereAR-H（943M）はARモデルの新たな最先端を達成し、FID 1.34を記録した。さらに小規模なモデルにおいても、SphereAR-L（479M）はFID 1.54、SphereAR-B（208M）は1.92を達成し、MAR-H（943M, 1.55）やVAR-d30（2B, 1.92）といったより大規模なベースラインを凌駕または同等の性能を示した。我々の知る限り、ラスター順序による純粋な次トークンAR画像生成器が、同等のパラメータ規模において拡散モデルやマスク生成モデルを上回ったのはこれが初めてである。

English

Autoregressive (AR) models are promising for image generation, yet continuous-token AR variants often trail latent diffusion and masked-generation models. The core issue is heterogeneous variance in VAE latents, which is amplified during AR decoding, especially under classifier-free guidance (CFG), and can cause variance collapse. We propose SphereAR to address this issue. Its core design is to constrain all AR inputs and outputs -- including after CFG -- to lie on a fixed-radius hypersphere (constant ell_2 norm), leveraging hyperspherical VAEs. Our theoretical analysis shows that hyperspherical constraint removes the scale component (the primary cause of variance collapse), thereby stabilizing AR decoding. Empirically, on ImageNet generation, SphereAR-H (943M) sets a new state of the art for AR models, achieving FID 1.34. Even at smaller scales, SphereAR-L (479M) reaches FID 1.54 and SphereAR-B (208M) reaches 1.92, matching or surpassing much larger baselines such as MAR-H (943M, 1.55) and VAR-d30 (2B, 1.92). To our knowledge, this is the first time a pure next-token AR image generator with raster order surpasses diffusion and masked-generation models at comparable parameter scales.

超球面潜在変数による連続トークン自己回帰生成の改善

Hyperspherical Latents Improve Continuous-Token Autoregressive Generation

要旨

Support