PLUME:基于潜在推理的通用多模态嵌入
PLUME: Latent Reasoning Based Universal Multimodal Embedding
April 2, 2026
作者: Chenwei He, Xiangzhao Hao, Tianyu Yang, Yuxiang Ma, Yuheng Jia, Lingxiang Wu, Chaoyang Zhao, Haiyun Guo, Jinqiao Wang
cs.AI
摘要
通用多模态嵌入(UME)通过单一模型将异构输入映射到共享检索空间。现有方法通过在提取嵌入前生成显式思维链(CoT)推理依据来改进UME,使多模态大语言模型能更好推断复杂查询意图。然而显式CoT会带来巨大推理开销,并将丰富的多模态证据压缩至狭窄的文本瓶颈。我们提出PLUME——一种潜在推理框架,通过用连续潜在状态的短自回归推演替代语言化CoT来推进UME发展。为支持多样化多模态查询,PLUME进一步引入语义锚点引导的转移适配器,在相同固定计算预算下沿不同推理轨迹引导潜在状态推演。为稳定训练,PLUME采用渐进式显隐转换学习策略:仅将语言化推理作为临时训练支架,逐步将该行为迁移至隐状态计算,最终在推理时消除显式CoT。在包含78项任务的MMEB-v2基准测试中,PLUME在将推理过程从数百个生成标记缩减至不足10个潜在步骤的同时,性能超越强显式CoT基线,实现超过30倍的推理加速。PLUME特别适用于证据密集、结构复杂且难以通过语言化中间推理组织相关性的检索场景(如视频与视觉文档检索)。这些结果表明,结构化潜在计算可在避免显式推理生成开销的前提下保留中间推理优势,为实用检索系统提供更强效的范式。
English
Universal multimodal embedding (UME) maps heterogeneous inputs into a shared retrieval space with a single model. Recent approaches improve UME by generating explicit chain-of-thought (CoT) rationales before extracting embeddings, enabling multimodal large language models to better infer complex query intent. However, explicit CoT incurs substantial inference overhead and can compress rich multimodal evidence into a narrow textual bottleneck. We propose PLUME, a latent reasoning framework that advances UME by replacing verbalized CoT with a short autoregressive rollout of continuous latent states. To support diverse multimodal queries, PLUME further introduces a semantic-anchor-guided transition adapter that steers latent rollout along different reasoning trajectories under the same fixed computation budget. To stabilize training, PLUME adopts a progressive explicit-to-latent curriculum that uses verbalized reasoning only as a temporary training scaffold and gradually transfers this behavior into hidden-state computation, eliminating explicit CoT at inference. On the 78-task MMEB-v2 benchmark, PLUME outperforms strong explicit-CoT UME baselines while reducing reasoning from hundreds of generated tokens to fewer than 10 latent steps, delivering over 30x faster inference. PLUME is especially well suited to retrieval settings where relevant evidence is dense, structurally complex, and difficult to organize through verbalized intermediate rationales, such as video and visual document retrieval. These results show that structured latent computation can preserve the benefits of intermediate reasoning without the overhead of explicit rationale generation, providing a stronger and more efficient paradigm for practical retrieval systems.