ChatPaper.aiChatPaper

基于掩码比特建模的自回归图像生成

Autoregressive Image Generation with Masked Bit Modeling

February 9, 2026
作者: Qihang Yu, Qihao Liu, Ju He, Xinyang Zhang, Yang Liu, Liang-Chieh Chen, Xi Chen
cs.AI

摘要

本文对视觉生成领域连续管道的统治地位提出了挑战。我们系统性地研究了离散与连续方法之间的性能差异。与"离散分词器本质逊色"的普遍认知相反,我们证明这种差异主要源于潜在空间中分配的比特总数(即压缩比)。通过扩大码本规模可有效弥合该差距,使离散分词器能够媲美甚至超越连续模型。然而现有离散生成方法难以利用这一发现,在码本扩展时面临性能退化或训练成本过高的问题。为此,我们提出掩码比特自回归建模(BAR)——支持任意码本规模的可扩展框架。通过为自回归变换器配备掩码比特建模头,BAR通过逐位生成离散令牌的组成比特来实现预测。该方法在ImageNet-256数据集上创下0.99的gFID新纪录,在连续与离散范式下均超越主流方法,同时显著降低采样成本并较先前连续方法收敛更快。项目页面详见https://bar-gen.github.io/。
English
This paper challenges the dominance of continuous pipelines in visual generation. We systematically investigate the performance gap between discrete and continuous methods. Contrary to the belief that discrete tokenizers are intrinsically inferior, we demonstrate that the disparity arises primarily from the total number of bits allocated in the latent space (i.e., the compression ratio). We show that scaling up the codebook size effectively bridges this gap, allowing discrete tokenizers to match or surpass their continuous counterparts. However, existing discrete generation methods struggle to capitalize on this insight, suffering from performance degradation or prohibitive training costs with scaled codebook. To address this, we propose masked Bit AutoRegressive modeling (BAR), a scalable framework that supports arbitrary codebook sizes. By equipping an autoregressive transformer with a masked bit modeling head, BAR predicts discrete tokens through progressively generating their constituent bits. BAR achieves a new state-of-the-art gFID of 0.99 on ImageNet-256, outperforming leading methods across both continuous and discrete paradigms, while significantly reducing sampling costs and converging faster than prior continuous approaches. Project page is available at https://bar-gen.github.io/
PDF41February 12, 2026