ChatPaper.aiChatPaper

迈向长视界智能体多模态搜索

Towards Long-horizon Agentic Multimodal Search

April 14, 2026
作者: Yifan Du, Zikang Liu, Jinbiao Peng, Jie Wu, Junyi Li, Jinyang Li, Wayne Xin Zhao, Ji-Rong Wen
cs.AI

摘要

多模态深度搜索代理通过迭代收集文本与视觉证据,在解决复杂任务方面展现出巨大潜力。然而,由于现有方法常面临上下文爆炸或关键视觉信号丢失的问题,如何管理长周期多模态输入中的异构信息与高令牌成本仍是关键挑战。为此,我们提出了一种新型长周期多模态深度搜索框架LMM-Searcher,其核心是基于文件的视觉表示机制。通过将视觉资源卸载至外部文件系统并映射为轻量级文本标识符(UID),我们的方法在保留多模态信息供后续调用的同时,有效降低了上下文开销。我们为代理配备了定制化的图像获取工具,实现了渐进式按需视觉加载的主动感知策略。此外,我们设计了数据合成流程,用于生成需要复杂跨模态多跳推理的查询指令。基于该流程,我们提炼出1.2万条高质量轨迹数据,对Qwen3-VL-Thinking-30A3B进行微调,将其转化为专业的多模态深度搜索代理。在四个基准测试上的大量实验表明,我们的方法可成功扩展至100轮搜索周期,在MM-BrowseComp和MMSearch-Plus等挑战性长周期基准中达到开源模型的领先性能,同时在不同基础模型上展现出强泛化能力。代码将在https://github.com/RUCAIBox/LMM-Searcher发布。
English
Multimodal deep search agents have shown great potential in solving complex tasks by iteratively collecting textual and visual evidence. However, managing the heterogeneous information and high token costs associated with multimodal inputs over long horizons remains a critical challenge, as existing methods often suffer from context explosion or the loss of crucial visual signals. To address this, we propose a novel Long-horizon MultiModal deep search framework, named LMM-Searcher, centered on a file-based visual representation mechanism. By offloading visual assets to an external file system and mapping them to lightweight textual identifiers (UIDs), our approach mitigates context overhead while preserving multimodal information for future access. We equip the agent with a tailored fetch-image tool, enabling a progressive, on-demand visual loading strategy for active perception. Furthermore, we introduce a data synthesis pipeline designed to generate queries requiring complex cross-modal multi-hop reasoning. Using this pipeline, we distill 12K high-quality trajectories to fine-tune Qwen3-VL-Thinking-30A3B into a specialized multimodal deep search agent. Extensive experiments across four benchmarks demonstrate that our method successfully scales to 100-turn search horizons, achieving state-of-the-art performance among open-source models on challenging long-horizon benchmarks like MM-BrowseComp and MMSearch-Plus, while also exhibiting strong generalizability across different base models. Our code will be released in https://github.com/RUCAIBox/LMM-Searcher.
PDF151April 16, 2026