Embed-RL:面向推理驱动的多模态嵌入的强化学习
Embed-RL: Reinforcement Learning for Reasoning-Driven Multimodal Embeddings
February 14, 2026
作者: Haonan Jiang, Yuji Wang, Yongjie Zhu, Xin Lu, Wenyu Qin, Meng Wang, Pengfei Wan, Yansong Tang
cs.AI
摘要
利用多模态大语言模型(MLLMs)已成为推进通用多模态嵌入(UME)技术以应对多样化跨模态任务的关键。近期研究表明,相较于判别式方法,引入生成式思维链(CoT)推理能显著增强任务特定表征。然而,现有生成式嵌入方法所产生的推理CoT仅局限于对查询的文本分析,且与目标检索无关。为解决这些局限性,我们提出一种推理驱动的UME框架,该框架通过嵌入器引导的强化学习(EG-RL)来优化推理器,使其生成具备可追溯性的证据化思维链(T-CoT)。我们的核心贡献包括三方面:(1)设计了EG-RL框架,通过嵌入器为推理器提供显式监督,确保生成的CoT轨迹与嵌入任务对齐;(2)提出T-CoT机制,通过提取关键多模态线索聚焦检索相关要素,为嵌入器提供多模态输入;(3)在有限计算资源下,我们的框架在MMEB-V2和UVRB基准测试中均超越先驱性嵌入模型。通过将多模态证据融入结构化推理,并与检索导向的对齐机制相结合,该框架有效增强了跨模态语义一致性,提升了模型的细粒度匹配能力及复杂场景下的泛化性能。本研究证明,定向推理优化能显著提升多模态嵌入质量,为推理驱动的UME发展提供了实用高效的解决方案。
English
Leveraging Multimodal Large Language Models (MLLMs) has become pivotal for advancing Universal Multimodal Embeddings (UME) in addressing diverse cross-modal tasks. Recent studies demonstrate that incorporating generative Chain-of-Thought (CoT) reasoning can substantially enhance task-specific representations compared to discriminative methods. However, the generated reasoning CoTs of existing generative embedding methods are limited to the textual analysis of queries and are irrelevant to the retrieval of the targets. To address these limitations, we propose a reasoning-driven UME framework that integrates Embedder-Guided Reinforcement Learning (EG-RL) to optimize the Reasoner to produce evidential Traceability CoT (T-CoT). Our key contributions are threefold: (1) We design an EG-RL framework where the Embedder provides explicit supervision to the Reasoner, ensuring the generated CoT traces are aligned with embedding tasks. (2) We introduce T-CoT, which extracts critical multimodal cues to focus on retrieval-relevant elements and provides multimodal inputs for the Embedder. (3) With limited computational resources, our framework outperforms the pioneering embedding model on both MMEB-V2 and UVRB benchmarks. The integration of multimodal evidence in structured reasoning, paired with retrieval-oriented alignment, effectively strengthens cross-modal semantic consistency and boosts the fine-grained matching capability of the model as well as the generalization across complex scenarios. Our work demonstrates that targeted reasoning optimization can significantly improve multimodal embedding quality, providing a practical and efficient solution for reasoning-driven UME development.