ChatPaper.aiChatPaper

UME-R1:探索推理驱动的生成式多模态嵌入研究

UME-R1: Exploring Reasoning-Driven Generative Multimodal Embeddings

November 1, 2025
作者: Zhibin Lan, Liqiang Niu, Fandong Meng, Jie Zhou, Jinsong Su
cs.AI

摘要

多模态大语言模型(MLLMs)取得的显著成功推动了多模态嵌入技术的进展,然而现有模型本质上仍属于判别式模型,限制了其从推理驱动的生成范式中获益的能力。本研究开创性地探索生成式嵌入方法,将嵌入任务统一于生成范式之中。我们提出UME-R1——一种通用多模态嵌入框架,采用两阶段训练策略:通过冷启动监督微调使模型具备推理能力,可同时生成判别式与生成式嵌入;后续的强化学习则增强推理能力并进一步优化生成式嵌入质量。这项开创性工作揭示了四个关键发现:1)生成式嵌入通过利用MLLMs强大的生成推理能力,相较传统判别式嵌入实现显著性能提升;2)判别式与生成式嵌入具有互补性,二者结合的预言机性能远超单一模式;3)强化学习能有效增强生成式嵌入,建立可扩展的优化范式;4)推理阶段的重复采样可提升下游任务覆盖率(pass@k),彰显生成式嵌入在推理时的可扩展潜力。在涵盖视频、图像及视觉文档的78个任务MMEB-V2基准测试中,UME-R1显著优于传统判别式嵌入模型,为构建更具可解释性、推理驱动的生成式多模态嵌入奠定了基础。我们的代码、模型及数据集将公开于https://github.com/XMUDeepLIT/UME-R1。
English
The remarkable success of multimodal large language models (MLLMs) has driven advances in multimodal embeddings, yet existing models remain inherently discriminative, limiting their ability to benefit from reasoning-driven generation paradigm. In this work, we pioneer the exploration of generative embeddings, unifying embedding tasks within a generative paradigm. We propose UME-R1, a universal multimodal embedding framework consisting of a two-stage training strategy: a cold-start supervised fine-tuning equips the model with reasoning capabilities and enables it to generate both discriminative and generative embeddings; a subsequent reinforcement learning enhances reasoning and further optimizes generative embedding quality. This pioneering work reveals four key insights: 1) generative embeddings unlock substantial performance gains over conventional discriminative embeddings by leveraging the powerful generative reasoning capabilities of MLLMs; 2) discriminative and generative embeddings are complementary, whose combined oracle performance far exceeding that of either alone; 3) RL can effectively enhance generative embeddings, establishing a scalable optimization paradigm.; 4) repeated sampling at inference boosts downstream task coverage (pass@k), highlighting the inference-time scalability potential of generative embeddings. Evaluated on the MMEB-V2 benchmark across 78 tasks spanning video, image, and visual documents, UME-R1 significantly outperforms conventional discriminative embedding models and offers a foundation for more interpretable, reasoning-driven generative multimodal embeddings. Our code, models, and datasets will be publicly available at https://github.com/XMUDeepLIT/UME-R1.
PDF51January 19, 2026