ChatPaper.aiChatPaper

潜隐统一模型:通过潜在空间统一模型释放交错跨模态推理的潜力

LatentUM: Unleashing the Potential of Interleaved Cross-Modal Reasoning via a Latent-Space Unified Model

April 2, 2026
作者: Jiachun Jin, Zetong Zhou, Xiao Yang, Hao Zhang, Pengfei Liu, Jun Zhu, Zhijie Deng
cs.AI

摘要

统一模型(UMs)因其理解与生成跨异构模态内容的能力而展现出巨大潜力。与单纯生成视觉内容相比,利用统一模型进行交错式跨模态推理更具前景和价值,例如解决需要密集视觉思维的理解问题、通过自我反思改进视觉生成,或在逐步行动干预指导下对物理世界的视觉动态进行建模。然而,现有统一模型由于采用割裂的视觉表示体系,必须依赖像素解码作为理解与生成之间的桥梁,这种方式既低效又不经济。本文提出LatentUM——一种在共享语义潜空间内表征所有模态的新型统一模型,消除了视觉理解与生成之间对像素空间中介的依赖。该设计天然支持灵活的交错式跨模态推理与生成。除提升计算效率外,共享表征显著减轻了编解码器偏差并强化了跨模态对齐,使LatentUM在视觉空间规划基准测试中达到最优性能,通过自我反思突破视觉生成的极限,并能在共享语义潜空间内预测未来视觉状态以支持世界建模。
English
Unified models (UMs) hold promise for their ability to understand and generate content across heterogeneous modalities. Compared to merely generating visual content, the use of UMs for interleaved cross-modal reasoning is more promising and valuable, e.g., for solving understanding problems that require dense visual thinking, improving visual generation through self-reflection, or modeling visual dynamics of the physical world guided by stepwise action interventions. However, existing UMs necessitate pixel decoding as a bridge due to their disjoint visual representations for understanding and generation, which is both ineffective and inefficient. In this paper, we introduce LatentUM, a novel unified model that represents all modalities within a shared semantic latent space, eliminating the need for pixel-space mediation between visual understanding and generation. This design naturally enables flexible interleaved cross-modal reasoning and generation. Beyond improved computational efficiency, the shared representation substantially alleviates codec bias and strengthens cross-modal alignment, allowing LatentUM to achieve state-of-the-art performance on the Visual Spatial Planning benchmark, push the limits of visual generation through self-reflection, and support world modeling by predicting future visual states within the shared semantic latent space.
PDF203April 4, 2026