潛在統一模型:透過潛在空間統一模型釋放交錯跨模態推理的潛能
LatentUM: Unleashing the Potential of Interleaved Cross-Modal Reasoning via a Latent-Space Unified Model
April 2, 2026
作者: Jiachun Jin, Zetong Zhou, Xiao Yang, Hao Zhang, Pengfei Liu, Jun Zhu, Zhijie Deng
cs.AI
摘要
統一模型(UMs)因其理解與生成異質模態內容的能力而備受期待。相較於僅生成視覺內容,將UMs用於交錯式跨模態推理更具前景與價值,例如解決需要密集視覺思維的理解問題、透過自我反思改進視覺生成,或在逐步行動干預引導下對物理世界的視覺動態進行建模。然而,現有UMs由於其理解與生成功能採用分離的視覺表徵,必須以像素解碼作為橋接手段,這種方式既低效又不經濟。本文提出LatentUM,一種新穎的統一模型,將所有模態表徵於共享語義潛在空間中,消除了視覺理解與生成之間對像素空間中介的需求。此設計自然支持靈活的交錯式跨模態推理與生成。除了提升計算效率外,共享表徵顯著減輕了編解碼器偏差並強化了跨模態對齊,使LatentUM能在視覺空間規劃基準測試中實現最先進性能,透過自我反思突破視覺生成的極限,並透過在共享語義潛在空間內預測未來視覺狀態來支持世界建模。
English
Unified models (UMs) hold promise for their ability to understand and generate content across heterogeneous modalities. Compared to merely generating visual content, the use of UMs for interleaved cross-modal reasoning is more promising and valuable, e.g., for solving understanding problems that require dense visual thinking, improving visual generation through self-reflection, or modeling visual dynamics of the physical world guided by stepwise action interventions. However, existing UMs necessitate pixel decoding as a bridge due to their disjoint visual representations for understanding and generation, which is both ineffective and inefficient. In this paper, we introduce LatentUM, a novel unified model that represents all modalities within a shared semantic latent space, eliminating the need for pixel-space mediation between visual understanding and generation. This design naturally enables flexible interleaved cross-modal reasoning and generation. Beyond improved computational efficiency, the shared representation substantially alleviates codec bias and strengthens cross-modal alignment, allowing LatentUM to achieve state-of-the-art performance on the Visual Spatial Planning benchmark, push the limits of visual generation through self-reflection, and support world modeling by predicting future visual states within the shared semantic latent space.