UniCom：基于压缩连续语义表征的统一多模态建模

摘要

当前统一多模态模型通常依赖离散化视觉标记器来弥合模态差异。然而离散化过程不可避免地会丢弃细粒度语义信息，导致视觉理解任务性能欠佳。与之相对，直接对连续语义表征（如CLIP、SigLIP）进行建模会面临高维生成建模的重大挑战，导致收敛速度缓慢和训练不稳定性。为解决这一困境，我们提出UniCom框架——通过压缩连续表征实现多模态理解与生成的统一协同。实证研究表明，在重建与生成任务中，降低通道维度远比空间下采样更为有效。基于此，我们设计了基于注意力机制的语义压缩器，将稠密特征提炼为紧凑的统一表征。此外，我们验证了transfusion架构在收敛性与一致性方面优于基于查询的设计。实验表明，UniCom在统一模型中实现了最先进的生成性能。值得注意的是，通过保留丰富语义先验，该框架在图像编辑中展现出卓越的可控性，即使不依赖VAE也能保持图像一致性。

English

Current unified multimodal models typically rely on discrete visual tokenizers to bridge the modality gap. However, discretization inevitably discards fine-grained semantic information, leading to suboptimal performance in visual understanding tasks. Conversely, directly modeling continuous semantic representations (e.g., CLIP, SigLIP) poses significant challenges in high-dimensional generative modeling, resulting in slow convergence and training instability. To resolve this dilemma, we introduce UniCom, a unified framework that harmonizes multimodal understanding and generation via compressed continuous representation. We empirically demonstrate that reducing channel dimension is significantly more effective than spatial downsampling for both reconstruction and generation. Accordingly, we design an attention-based semantic compressor to distill dense features into a compact unified representation. Furthermore, we validate that the transfusion architecture surpasses query-based designs in convergence and consistency. Experiments demonstrate that UniCom achieves state-of-the-art generation performance among unified models. Notably, by preserving rich semantic priors, it delivers exceptional controllability in image editing and maintains image consistency even without relying on VAE.

UniCom：基于压缩连续语义表征的统一多模态建模

UniCom: Unified Multimodal Modeling via Compressed Continuous Semantic Representations

摘要

Support