마스킹된 비트 모델링을 활용한 자기회귀 이미지 생성

초록

본 논문은 시각 생성 분야에서 연속적 파이프라인이 차지하는 주류적 위치에 의문을 제기합니다. 우리는 이산적 방법과 연속적 방법 간의 성능 격차를 체계적으로 조사합니다. 이산적 토크나이저가 본질적으로 열등하다는 통념과는 달리, 이러한 차이가 주로 잠재 공간에 할당된 총 비트 수(즉, 압축률)에서 비롯됨을 입증합니다. 코드북 크기를 확장하면 이 격차를 효과적으로 해소하여 이산적 토크나이저가 연속적 방식을 능가하거나 동등한 성능을 달성할 수 있음을 보여줍니다. 그러나 기존 이산적 생성 방법은 확장된 코드북에서 성능 저하나 감당하기 어려운 학습 비용 문제로 인해 이러한 통찰을 활용하지 못하고 있습니다. 이를 해결하기 위해 우리는 임의의 코드북 크기를 지원하는 확장 가능한 프레임워크인 마스크드 비트 자기회귀 모델링(BAR)을 제안합니다. 자기회귀 트랜스포머에 마스크드 비트 모델링 헤드를 장착함으로써 BAR는 구성 비트를 점진적으로 생성하여 이산 토큰을 예측합니다. BAR는 ImageNet-256에서 0.99의 새로운 최첨단 gFID를 달성하여 연속 및 이산 패러다임의 선도적 방법들을 모두 능가하는 동시에 샘플링 비용을 현저히 절감하고 기존 연속적 접근법보다 빠르게 수렴합니다. 프로젝트 페이지는 https://bar-gen.github.io/에서 확인할 수 있습니다.

English

This paper challenges the dominance of continuous pipelines in visual generation. We systematically investigate the performance gap between discrete and continuous methods. Contrary to the belief that discrete tokenizers are intrinsically inferior, we demonstrate that the disparity arises primarily from the total number of bits allocated in the latent space (i.e., the compression ratio). We show that scaling up the codebook size effectively bridges this gap, allowing discrete tokenizers to match or surpass their continuous counterparts. However, existing discrete generation methods struggle to capitalize on this insight, suffering from performance degradation or prohibitive training costs with scaled codebook. To address this, we propose masked Bit AutoRegressive modeling (BAR), a scalable framework that supports arbitrary codebook sizes. By equipping an autoregressive transformer with a masked bit modeling head, BAR predicts discrete tokens through progressively generating their constituent bits. BAR achieves a new state-of-the-art gFID of 0.99 on ImageNet-256, outperforming leading methods across both continuous and discrete paradigms, while significantly reducing sampling costs and converging faster than prior continuous approaches. Project page is available at https://bar-gen.github.io/

마스킹된 비트 모델링을 활용한 자기회귀 이미지 생성

Autoregressive Image Generation with Masked Bit Modeling

초록

Support