ChatPaper.aiChatPaper

喝彩:解耦图像块细节与语义表征,实现统一的多模态理解与生成

Cheers: Decoupling Patch Details from Semantic Representations Enables Unified Multimodal Comprehension and Generation

March 13, 2026
作者: Yichen Zhang, Da Peng, Zonghao Guo, Zijian Zhang, Xuesong Yang, Tong Sun, Shichu Sun, Yidan Zhang, Yanghao Li, Haiyan Zhao, Wang Xu, Qi Shi, Yangang Sun, Chi Chen, Shuo Wang, Yukun Yan, Xu Han, Qiang Ma, Wei Ke, Liang Wang, Zhiyuan Liu, Maosong Sun
cs.AI

摘要

多模态建模领域近期的前沿课题在于将视觉理解与生成任务统一于单一模型中。然而这两种任务需要不匹配的解码机制与视觉表征方式,使得在共享特征空间内进行联合优化具有挑战性。本文提出Cheers模型,通过解耦图像块级细节与语义表征,既稳定了多模态理解的语义基础,又通过门控细节残差提升图像生成的保真度。该框架包含三大核心组件:(i)统一视觉分词器,将图像潜在状态编码压缩为语义标记以高效适配大语言模型;(ii)基于LLM的Transformer架构,统一文本的自回归解码与图像的扩散解码;(iii)级联流匹配头,先解码视觉语义再注入视觉分词器提供的语义门控细节残差以优化高频内容。主流基准测试表明,Cheers在视觉理解与生成任务上均达到或超越先进统一多模态模型水平,同时实现4倍标记压缩,支持更高效的高分辨率图像编码与生成。值得注意的是,Cheers在GenEval和MMBench基准上超越Tar-1.5B模型,仅需其20%训练成本,展现出高效能效比的统一多模态建模能力。我们将公开全部代码与数据以促进后续研究。
English
A recent cutting-edge topic in multimodal modeling is to unify visual comprehension and generation within a single model. However, the two tasks demand mismatched decoding regimes and visual representations, making it non-trivial to jointly optimize within a shared feature space. In this work, we present Cheers, a unified multimodal model that decouples patch-level details from semantic representations, thereby stabilizing semantics for multimodal understanding and improving fidelity for image generation via gated detail residuals. Cheers includes three key components: (i) a unified vision tokenizer that encodes and compresses image latent states into semantic tokens for efficient LLM conditioning, (ii) an LLM-based Transformer that unifies autoregressive decoding for text generation and diffusion decoding for image generation, and (iii) a cascaded flow matching head that decodes visual semantics first and then injects semantically gated detail residuals from the vision tokenizer to refine high-frequency content. Experiments on popular benchmarks demonstrate that Cheers matches or surpasses advanced UMMs in both visual understanding and generation. Cheers also achieves 4x token compression, enabling more efficient high-resolution image encoding and generation. Notably, Cheers outperforms the Tar-1.5B on the popular benchmarks GenEval and MMBench, while requiring only 20% of the training cost, indicating effective and efficient (i.e., 4x token compression) unified multimodal modeling. We will release all code and data for future research.
PDF383March 30, 2026