ChatPaper.aiChatPaper

DeepImageSearch:視覺歷史中上下文感知圖像檢索的多模態代理基準測試

DeepImageSearch: Benchmarking Multimodal Agents for Context-Aware Image Retrieval in Visual Histories

February 11, 2026
作者: Chenlong Deng, Mengjie Deng, Junjie Wu, Dun Zeng, Teng Wang, Qingsong Xie, Jiadeng Huang, Shengjie Ma, Changwang Zhang, Zhaoxiang Wang, Jun Wang, Yutao Zhu, Zhicheng Dou
cs.AI

摘要

現有的多模態檢索系統雖擅長語義匹配,卻隱含了查詢-圖像相關性可獨立衡量的前提。這種範式忽略了真實視覺流中固有的豐富依賴關係——在實際場景中,信息分佈於時間序列而非侷限於單一快照。為彌合這一差距,我們提出DeepImageSearch,一種新型能動範式,將圖像檢索重新定義為自主探索任務。模型需對原始視覺歷史進行多步推理規劃與執行,從而基於隱性上下文線索定位目標。我們構建了DISBench這一基於關聯視覺數據的挑戰性基準測試。為解決上下文依賴查詢的擴展性難題,我們提出人機協作流程,利用視覺語言模型挖掘潛在時空關聯,將密集的上下文發現任務前置於人工驗證環節。此外,我們採用具備細粒度工具和雙記憶系統的模塊化智能體框架,構建了適用於長程導航的強健基準模型。大量實驗表明,DISBench對現有頂尖模型構成顯著挑戰,證明了將能動推理融入下一代檢索系統的必要性。
English
Existing multimodal retrieval systems excel at semantic matching but implicitly assume that query-image relevance can be measured in isolation. This paradigm overlooks the rich dependencies inherent in realistic visual streams, where information is distributed across temporal sequences rather than confined to single snapshots. To bridge this gap, we introduce DeepImageSearch, a novel agentic paradigm that reformulates image retrieval as an autonomous exploration task. Models must plan and perform multi-step reasoning over raw visual histories to locate targets based on implicit contextual cues. We construct DISBench, a challenging benchmark built on interconnected visual data. To address the scalability challenge of creating context-dependent queries, we propose a human-model collaborative pipeline that employs vision-language models to mine latent spatiotemporal associations, effectively offloading intensive context discovery before human verification. Furthermore, we build a robust baseline using a modular agent framework equipped with fine-grained tools and a dual-memory system for long-horizon navigation. Extensive experiments demonstrate that DISBench poses significant challenges to state-of-the-art models, highlighting the necessity of incorporating agentic reasoning into next-generation retrieval systems.
PDF423February 18, 2026