多模态检索的推理增强表征
Reasoning-Augmented Representations for Multimodal Retrieval
February 6, 2026
作者: Jianrui Zhang, Anirudh Sundara Rajan, Brandon Han, Soochahn Lee, Sukanta Ganguly, Yong Jae Lee
cs.AI
摘要
通用多模态检索(UMR)致力于实现文本与视觉间的任意模态互搜,但现代嵌入模型在面对需要潜在推理的查询(如解析未明确指代或匹配组合约束)时仍显脆弱。我们认为这种脆弱性常源于数据缺陷:当图像携带"隐性"证据且查询隐含关键语义时,单次嵌入过程需同时完成推理与压缩,易导致伪特征匹配。为此提出以数据为中心的解决方案,通过将推理过程外化至检索前阶段实现角色解耦。利用强视觉-语言模型对语料库条目中的视觉证据进行密集描述,解析查询中模糊的多模态指代,并将冗长指令重写为简洁的检索约束,从而使隐性语义显性化。仅靠推理时增强并不足够,检索器必须在经过语义强化的表征上进行训练,以规避分布偏移并充分挖掘新增信号。在M-BEIR基准测试中,我们的推理增强训练方法较基线模型取得稳定提升,消融实验表明:语料增强主要惠及知识密集型查询,而查询增强对组合式修改请求至关重要。代码已开源:https://github.com/AugmentedRetrieval/ReasoningAugmentedRetrieval。
English
Universal Multimodal Retrieval (UMR) seeks any-to-any search across text and vision, yet modern embedding models remain brittle when queries require latent reasoning (e.g., resolving underspecified references or matching compositional constraints). We argue this brittleness is often data-induced: when images carry "silent" evidence and queries leave key semantics implicit, a single embedding pass must both reason and compress, encouraging spurious feature matching. We propose a data-centric framework that decouples these roles by externalizing reasoning before retrieval. Using a strong Vision--Language Model, we make implicit semantics explicit by densely captioning visual evidence in corpus entries, resolving ambiguous multimodal references in queries, and rewriting verbose instructions into concise retrieval constraints. Inference-time enhancement alone is insufficient; the retriever must be trained on these semantically dense representations to avoid distribution shift and fully exploit the added signal. Across M-BEIR, our reasoning-augmented training method yields consistent gains over strong baselines, with ablations showing that corpus enhancement chiefly benefits knowledge-intensive queries while query enhancement is critical for compositional modification requests. We publicly release our code at https://github.com/AugmentedRetrieval/ReasoningAugmentedRetrieval.