ChatPaper.aiChatPaper

DeepImageSearch:面向视觉历史记录中上下文感知图像检索的多模态智能体基准测试

DeepImageSearch: Benchmarking Multimodal Agents for Context-Aware Image Retrieval in Visual Histories

February 11, 2026
作者: Chenlong Deng, Mengjie Deng, Junjie Wu, Dun Zeng, Teng Wang, Qingsong Xie, Jiadeng Huang, Shengjie Ma, Changwang Zhang, Zhaoxiang Wang, Jun Wang, Yutao Zhu, Zhicheng Dou
cs.AI

摘要

现有多模态检索系统虽擅长语义匹配,却隐含了一个假设:查询-图像相关性可被独立衡量。这一范式忽略了现实视觉流中固有的丰富依赖关系——信息分布于时间序列而非局限于单张快照。为弥补这一缺陷,我们提出DeepImageSearch这一新型智能体范式,将图像检索重新定义为自主探索任务。模型需对原始视觉历史进行多步推理规划,从而基于隐含上下文线索定位目标。我们构建了DISBench基准测试集,该数据集基于相互关联的视觉数据,具有挑战性。针对上下文依赖型查询的扩展性难题,我们提出人机协同流水线方案:通过视觉语言模型挖掘潜在时空关联,在人工验证前高效完成密集型上下文发现。此外,我们采用配备细粒度工具和双记忆系统的模块化智能体框架,构建了具有长程导航能力的强基线模型。大量实验表明,DISBench对现有顶尖模型构成显著挑战,印证了将智能体推理机制融入下一代检索系统的必要性。
English
Existing multimodal retrieval systems excel at semantic matching but implicitly assume that query-image relevance can be measured in isolation. This paradigm overlooks the rich dependencies inherent in realistic visual streams, where information is distributed across temporal sequences rather than confined to single snapshots. To bridge this gap, we introduce DeepImageSearch, a novel agentic paradigm that reformulates image retrieval as an autonomous exploration task. Models must plan and perform multi-step reasoning over raw visual histories to locate targets based on implicit contextual cues. We construct DISBench, a challenging benchmark built on interconnected visual data. To address the scalability challenge of creating context-dependent queries, we propose a human-model collaborative pipeline that employs vision-language models to mine latent spatiotemporal associations, effectively offloading intensive context discovery before human verification. Furthermore, we build a robust baseline using a modular agent framework equipped with fine-grained tools and a dual-memory system for long-horizon navigation. Extensive experiments demonstrate that DISBench poses significant challenges to state-of-the-art models, highlighting the necessity of incorporating agentic reasoning into next-generation retrieval systems.
PDF423February 18, 2026