V-Retriever:面向通用多模态检索的证据驱动型智能体推理
V-Retrver: Evidence-Driven Agentic Reasoning for Universal Multimodal Retrieval
February 5, 2026
作者: Dongyang Chen, Chaoyang Wang, Dezhao SU, Xi Xiao, Zeyu Zhang, Jing Xiong, Qing Li, Yuzhang Shang, Shichao Ka
cs.AI
摘要
多模态大语言模型近期被应用于通用跨模态检索任务,其中思维链推理技术有效提升了候选结果的重排序性能。然而现有方法仍以语言驱动为主导,依赖静态视觉编码且缺乏对细粒度视觉证据的主动验证能力,这易导致视觉模糊场景下的推测性推理。我们提出V-Retrver证据驱动检索框架,将跨模态检索重构为基于视觉检验的智能体推理过程。该框架使多模态大语言模型能通过外部视觉工具在推理过程中选择性获取视觉证据,执行假设生成与目标验证交替进行的多模态交错推理。为训练这种证据收集式检索智能体,我们采用课程学习策略,融合监督式推理激活、基于拒绝机制的优化以及证据对齐目标的强化学习。在多模态检索基准测试中,该方法在检索准确率(平均提升23.0%)、感知推理可靠性及泛化能力方面均取得显著提升。
English
Multimodal Large Language Models (MLLMs) have recently been applied to universal multimodal retrieval, where Chain-of-Thought (CoT) reasoning improves candidate reranking. However, existing approaches remain largely language-driven, relying on static visual encodings and lacking the ability to actively verify fine-grained visual evidence, which often leads to speculative reasoning in visually ambiguous cases. We propose V-Retrver, an evidence-driven retrieval framework that reformulates multimodal retrieval as an agentic reasoning process grounded in visual inspection. V-Retrver enables an MLLM to selectively acquire visual evidence during reasoning via external visual tools, performing a multimodal interleaved reasoning process that alternates between hypothesis generation and targeted visual verification.To train such an evidence-gathering retrieval agent, we adopt a curriculum-based learning strategy combining supervised reasoning activation, rejection-based refinement, and reinforcement learning with an evidence-aligned objective. Experiments across multiple multimodal retrieval benchmarks demonstrate consistent improvements in retrieval accuracy (with 23.0% improvements on average), perception-driven reasoning reliability, and generalization.