ChatPaper.aiChatPaper

歡慶:解耦圖像塊細節與語義表徵實現統一多模態理解與生成

Cheers: Decoupling Patch Details from Semantic Representations Enables Unified Multimodal Comprehension and Generation

March 13, 2026
作者: Yichen Zhang, Da Peng, Zonghao Guo, Zijian Zhang, Xuesong Yang, Tong Sun, Shichu Sun, Yidan Zhang, Yanghao Li, Haiyan Zhao, Wang Xu, Qi Shi, Yangang Sun, Chi Chen, Shuo Wang, Yukun Yan, Xu Han, Qiang Ma, Wei Ke, Liang Wang, Zhiyuan Liu, Maosong Sun
cs.AI

摘要

近期多模態建模的前沿課題是將視覺理解與生成任務統一於單一模型。然而,這兩類任務需要不匹配的解碼機制與視覺表徵,使得在共享特徵空間中進行聯合優化具有挑戰性。本文提出Cheers模型,通過將像素級細節與語義表徵解耦,既穩定了多模態理解的語義基礎,又通過門控細節殘差提升圖像生成的保真度。Cheers包含三大核心組件:(i) 統一視覺標記器,將圖像潛在狀態編碼壓縮為語義標記以供大語言模型高效調控;(ii) 基於LLM的Transformer架構,統一文本的自迴歸解碼與圖像的擴散解碼;(iii) 級聯流匹配頭,先解碼視覺語義,再注入由視覺標記器生成的語義門控細節殘差以優化高頻內容。在主流基準測試中,Cheers在視覺理解與生成任務上均達到或超越先進統一多模態模型的性能。該模型還實現了4倍的標記壓縮率,支持更高效的高分辨率圖像編碼與生成。值得注意的是,Cheers在GenEval和MMBench基準上超越Tar-1.5B模型,而訓練成本僅需後者的20%,展現出高效能(4倍標記壓縮)的統一多模態建模能力。我們將公開所有代碼與數據以促進後續研究。
English
A recent cutting-edge topic in multimodal modeling is to unify visual comprehension and generation within a single model. However, the two tasks demand mismatched decoding regimes and visual representations, making it non-trivial to jointly optimize within a shared feature space. In this work, we present Cheers, a unified multimodal model that decouples patch-level details from semantic representations, thereby stabilizing semantics for multimodal understanding and improving fidelity for image generation via gated detail residuals. Cheers includes three key components: (i) a unified vision tokenizer that encodes and compresses image latent states into semantic tokens for efficient LLM conditioning, (ii) an LLM-based Transformer that unifies autoregressive decoding for text generation and diffusion decoding for image generation, and (iii) a cascaded flow matching head that decodes visual semantics first and then injects semantically gated detail residuals from the vision tokenizer to refine high-frequency content. Experiments on popular benchmarks demonstrate that Cheers matches or surpasses advanced UMMs in both visual understanding and generation. Cheers also achieves 4x token compression, enabling more efficient high-resolution image encoding and generation. Notably, Cheers outperforms the Tar-1.5B on the popular benchmarks GenEval and MMBench, while requiring only 20% of the training cost, indicating effective and efficient (i.e., 4x token compression) unified multimodal modeling. We will release all code and data for future research.
PDF383March 30, 2026