マスクされたビットモデリングによる自己回帰的画像生成

要旨

本論文は、ビジュアル生成における連続的パイプラインの優位性に異議を唱える。我々は、離散的手法と連続的手法の性能差を体系的に調査した。離散的トークナイザーが本質的に劣るという通説に反し、この差は主に潜在空間に割り当てられる総ビット数（すなわち、圧縮率）に起因することを実証する。コードブックサイズをスケールアップすることでこの差が効果的に埋まり、離散的トークナイザーが連続的手法に匹敵し、あるいは凌駕しうることを示す。しかし、既存の離散的生成手法はこの知見を活かしきれず、コードブックのスケールアップに伴う性能劣化や膨大な学習コストに悩まされている。この問題を解決するため、我々は任意のコードブックサイズをサポートするスケーラブルなフレームワークであるmasked Bit AutoRegressive modeling (BAR)を提案する。オートリグレッシブ変換器にマスクされたビットモデリングヘッドを装備することで、BARは離散トークンをその構成ビットを段階的に生成することで予測する。BARはImageNet-256において0.99という新たなstate-of-the-art gFIDを達成し、連続・離散両パラダイムの主要手法を凌駕する性能を示すとともに、サンプリングコストを大幅に削減し、従来の連続的アプローチよりも高速に収束する。プロジェクトページはhttps://bar-gen.github.io/で公開されている。

English

This paper challenges the dominance of continuous pipelines in visual generation. We systematically investigate the performance gap between discrete and continuous methods. Contrary to the belief that discrete tokenizers are intrinsically inferior, we demonstrate that the disparity arises primarily from the total number of bits allocated in the latent space (i.e., the compression ratio). We show that scaling up the codebook size effectively bridges this gap, allowing discrete tokenizers to match or surpass their continuous counterparts. However, existing discrete generation methods struggle to capitalize on this insight, suffering from performance degradation or prohibitive training costs with scaled codebook. To address this, we propose masked Bit AutoRegressive modeling (BAR), a scalable framework that supports arbitrary codebook sizes. By equipping an autoregressive transformer with a masked bit modeling head, BAR predicts discrete tokens through progressively generating their constituent bits. BAR achieves a new state-of-the-art gFID of 0.99 on ImageNet-256, outperforming leading methods across both continuous and discrete paradigms, while significantly reducing sampling costs and converging faster than prior continuous approaches. Project page is available at https://bar-gen.github.io/

マスクされたビットモデリングによる自己回帰的画像生成

Autoregressive Image Generation with Masked Bit Modeling

要旨

Support