基于掩码位建模的自回归图像生成
Autoregressive Image Generation with Masked Bit Modeling
February 9, 2026
作者: Qihang Yu, Qihao Liu, Ju He, Xinyang Zhang, Yang Liu, Liang-Chieh Chen, Xi Chen
cs.AI
摘要
本文对视觉生成领域连续式管道的统治地位提出挑战。我们系统性地研究了离散与连续方法之间的性能差异。与普遍认为离散分词器存在固有劣势的观点相反,我们证明这种差异主要源于潜在空间中分配的比特总数(即压缩率)。通过扩大码本规模可有效弥合该差距,使离散分词器达到甚至超越连续方法的性能。然而现有离散生成方法难以利用这一发现,在码本扩展时会出现性能下降或训练成本过高的问题。为此,我们提出掩码比特自回归建模(BAR)——一个支持任意码本规模的可扩展框架。通过为自回归变换器配备掩码比特建模头,BAR通过逐位生成离散令牌的组成比特进行预测。该方法在ImageNet-256数据集上实现了0.99的最新gFID指标,在连续与离散范式下均超越主流方法,同时显著降低采样成本,且比现有连续方法收敛更快。项目页面详见https://bar-gen.github.io/
English
This paper challenges the dominance of continuous pipelines in visual generation. We systematically investigate the performance gap between discrete and continuous methods. Contrary to the belief that discrete tokenizers are intrinsically inferior, we demonstrate that the disparity arises primarily from the total number of bits allocated in the latent space (i.e., the compression ratio). We show that scaling up the codebook size effectively bridges this gap, allowing discrete tokenizers to match or surpass their continuous counterparts. However, existing discrete generation methods struggle to capitalize on this insight, suffering from performance degradation or prohibitive training costs with scaled codebook. To address this, we propose masked Bit AutoRegressive modeling (BAR), a scalable framework that supports arbitrary codebook sizes. By equipping an autoregressive transformer with a masked bit modeling head, BAR predicts discrete tokens through progressively generating their constituent bits. BAR achieves a new state-of-the-art gFID of 0.99 on ImageNet-256, outperforming leading methods across both continuous and discrete paradigms, while significantly reducing sampling costs and converging faster than prior continuous approaches. Project page is available at https://bar-gen.github.io/