MergeVQ: 분리된 토큰 병합 및 양자화를 통한 시각적 생성과 표현을 위한 통합 프레임워크

초록

벡터 양자화(VQ) 기반의 마스크 이미지 모델링(MIM)은 자기 지도 사전 학습과 이미지 생성 모두에서 큰 성공을 거두었습니다. 그러나 대부분의 기존 방법들은 생성 품질과 표현 학습 및 효율성 간의 공유 잠재 공상에서의 균형을 맞추는 데 어려움을 겪습니다. 이러한 패러다임의 한계를 극복하기 위해, 우리는 토큰 병합 기술을 VQ 기반 생성 모델에 통합하여 이미지 생성과 시각적 표현 학습 간의 격차를 단일 아키텍처에서 해결하는 MergeVQ를 제안합니다. 사전 학습 과정에서 MergeVQ는 인코더의 자기 주의 블록 이후에 토큰 병합 모듈을 통해 상위-k 의미를 잠재 공간에서 분리하여 후속 Look-up Free 양자화(LFQ)와 전역 정렬을 수행하고, 디코더의 교차 주의를 통해 세부 사항을 복원하여 재구성을 수행합니다. 두 번째 단계의 생성을 위해, 우리는 효율적인 래스터 순서 예측을 위한 KV 캐시 압축을 수행하는 MergeAR을 도입합니다. ImageNet에서의 광범위한 실험을 통해 MergeVQ가 AR 생성 모델로서 시각적 표현 학습과 이미지 생성 작업 모두에서 경쟁력 있는 성능을 달성하면서도 토큰 효율성과 추론 속도를 유지함을 검증했습니다. 코드와 모델은 https://apexgen-x.github.io/MergeVQ에서 확인할 수 있습니다.

English

Masked Image Modeling (MIM) with Vector Quantization (VQ) has achieved great success in both self-supervised pre-training and image generation. However, most existing methods struggle to address the trade-off in shared latent space for generation quality vs. representation learning and efficiency. To push the limits of this paradigm, we propose MergeVQ, which incorporates token merging techniques into VQ-based generative models to bridge the gap between image generation and visual representation learning in a unified architecture. During pre-training, MergeVQ decouples top-k semantics from latent space with the token merge module after self-attention blocks in the encoder for subsequent Look-up Free Quantization (LFQ) and global alignment and recovers their fine-grained details through cross-attention in the decoder for reconstruction. As for the second-stage generation, we introduce MergeAR, which performs KV Cache compression for efficient raster-order prediction. Extensive experiments on ImageNet verify that MergeVQ as an AR generative model achieves competitive performance in both visual representation learning and image generation tasks while maintaining favorable token efficiency and inference speed. The code and model will be available at https://apexgen-x.github.io/MergeVQ.

MergeVQ: 분리된 토큰 병합 및 양자화를 통한 시각적 생성과 표현을 위한 통합 프레임워크

MergeVQ: A Unified Framework for Visual Generation and Representation with Disentangled Token Merging and Quantization

초록

Support