ChatPaper.aiChatPaper

MergeVQ:基於解耦令牌合併與量化的視覺生成與表示統一框架

MergeVQ: A Unified Framework for Visual Generation and Representation with Disentangled Token Merging and Quantization

April 1, 2025
作者: Siyuan Li, Luyuan Zhang, Zedong Wang, Juanxi Tian, Cheng Tan, Zicheng Liu, Chang Yu, Qingsong Xie, Haonan Lu, Haoqian Wang, Zhen Lei
cs.AI

摘要

基於向量量化(VQ)的掩碼圖像建模(MIM)在自監督預訓練與圖像生成領域均取得了顯著成就。然而,現有方法大多難以在共享潛在空間中權衡生成質量與表示學習及效率之間的關係。為突破此範式的限制,我們提出了MergeVQ,該方法將令牌合併技術融入基於VQ的生成模型中,旨在統一架構下彌合圖像生成與視覺表示學習之間的鴻溝。在預訓練階段,MergeVQ通過編碼器自注意力模塊後的令牌合併模塊,將頂層語義與潛在空間解耦,以便進行後續的無查表量化(LFQ)及全局對齊,並在解碼器中通過交叉注意力恢復其細粒度細節以實現重建。針對第二階段的生成任務,我們引入了MergeAR,它執行KV緩存壓縮以實現高效的光柵順序預測。在ImageNet上的大量實驗驗證了MergeVQ作為自迴歸生成模型,在視覺表示學習與圖像生成任務中均展現出競爭力,同時保持了良好的令牌效率與推理速度。代碼及模型將發佈於https://apexgen-x.github.io/MergeVQ。
English
Masked Image Modeling (MIM) with Vector Quantization (VQ) has achieved great success in both self-supervised pre-training and image generation. However, most existing methods struggle to address the trade-off in shared latent space for generation quality vs. representation learning and efficiency. To push the limits of this paradigm, we propose MergeVQ, which incorporates token merging techniques into VQ-based generative models to bridge the gap between image generation and visual representation learning in a unified architecture. During pre-training, MergeVQ decouples top-k semantics from latent space with the token merge module after self-attention blocks in the encoder for subsequent Look-up Free Quantization (LFQ) and global alignment and recovers their fine-grained details through cross-attention in the decoder for reconstruction. As for the second-stage generation, we introduce MergeAR, which performs KV Cache compression for efficient raster-order prediction. Extensive experiments on ImageNet verify that MergeVQ as an AR generative model achieves competitive performance in both visual representation learning and image generation tasks while maintaining favorable token efficiency and inference speed. The code and model will be available at https://apexgen-x.github.io/MergeVQ.

Summary

AI-Generated Summary

PDF877April 3, 2025