多模态检索的推理增强表征

摘要

通用多模态检索（UMR）旨在实现文本与视觉间的任意模态互搜，然而当查询需要潜在推理（如解析未明确指代或匹配组合约束）时，现代嵌入模型仍显脆弱。我们认为这种脆弱性常源于数据缺陷：当图像携带"隐性"证据且查询隐含关键语义时，单次嵌入过程需同时完成推理与压缩，易引发伪特征匹配。为此，我们提出一种以数据为中心的框架，通过将推理过程外化至检索前阶段来解耦这两项任务。利用强视觉-语言模型，我们通过以下方式显化隐性语义：对语料库条目中的视觉证据进行密集描述，解析查询中的模糊多模态指代，并将冗长指令重写为简洁的检索约束。仅靠推理时增强并不足够，检索器必须在这些语义密集的表示上进行训练，以避免分布偏移并充分利用增强信号。在M-BEIR数据集上的实验表明，我们的推理增强训练方法较基线模型取得稳定提升，消融实验显示语料增强主要惠及知识密集型查询，而查询增强对组合式修改请求至关重要。代码已公开于https://github.com/AugmentedRetrieval/ReasoningAugmentedRetrieval。

English

Universal Multimodal Retrieval (UMR) seeks any-to-any search across text and vision, yet modern embedding models remain brittle when queries require latent reasoning (e.g., resolving underspecified references or matching compositional constraints). We argue this brittleness is often data-induced: when images carry "silent" evidence and queries leave key semantics implicit, a single embedding pass must both reason and compress, encouraging spurious feature matching. We propose a data-centric framework that decouples these roles by externalizing reasoning before retrieval. Using a strong Vision--Language Model, we make implicit semantics explicit by densely captioning visual evidence in corpus entries, resolving ambiguous multimodal references in queries, and rewriting verbose instructions into concise retrieval constraints. Inference-time enhancement alone is insufficient; the retriever must be trained on these semantically dense representations to avoid distribution shift and fully exploit the added signal. Across M-BEIR, our reasoning-augmented training method yields consistent gains over strong baselines, with ablations showing that corpus enhancement chiefly benefits knowledge-intensive queries while query enhancement is critical for compositional modification requests. We publicly release our code at https://github.com/AugmentedRetrieval/ReasoningAugmentedRetrieval.

多模态检索的推理增强表征

Reasoning-Augmented Representations for Multimodal Retrieval

摘要

Support