ChatPaper.aiChatPaper

OpenSearch-VL:面向前沿多模态搜索智能体的开放方案

OpenSearch-VL: An Open Recipe for Frontier Multimodal Search Agents

May 6, 2026
作者: Shuang Chen, Kaituo Feng, Hangting Chen, Wenxuan Huang, Dasen Dai, Quanxin Shou, Yunlong Lin, Xiangyu Yue, Shenghua Gao, Tianyu Pang
cs.AI

摘要

深度搜索已成为前沿多模态智能体的关键能力,使模型能通过主动搜索、证据验证和多步推理解决复杂问题。尽管进展迅速,顶级多模态搜索智能体仍难以复现,主要归因于缺乏高质量开源训练数据、透明的轨迹合成流程或详细的训练方案。为此,我们推出OpenSearch-VL——一个基于智能体强化学习的全开源前沿多模态深度搜索智能体训练方案。我们首先构建了专用数据流水线,通过维基百科路径采样、模糊实体重写和源锚点视觉定位来生成高质量训练数据,共同减少数据捷径和单步检索坍缩问题。基于此流水线,我们创建了两个训练数据集:用于监督微调的SearchVL-SFT-36k和用于强化学习的SearchVL-RL-8k。此外,我们设计了集成文本搜索、图像搜索、OCR、裁剪、锐化、超分辨率和透视校正的多样化工具环境,使智能体能将主动感知与外部知识获取相结合。最后,我们提出多轮致命错误感知的GRPO训练算法,通过掩码故障后令牌处理级联工具失效,同时利用单侧优势钳位保留故障前的有效推理。基于该方案,OpenSearch-VL实现了显著性能提升,在七大基准测试中平均得分提高超10分,并在多项任务中达到与专有商业模型相媲美的结果。我们将开源全部数据、代码和模型,以支持多模态深度搜索智能体的开放研究。
English
Deep search has become a crucial capability for frontier multimodal agents, enabling models to solve complex questions through active search, evidence verification, and multi-step reasoning. Despite rapid progress, top-tier multimodal search agents remain difficult to reproduce, largely due to the absence of open high-quality training data, transparent trajectory synthesis pipelines, or detailed training recipes. To this end, we introduce OpenSearch-VL, a fully open-source recipe for training frontier multimodal deep search agents with agentic reinforcement learning. First, we curated a dedicated pipeline to construct high-quality training data through Wikipedia path sampling, fuzzy entity rewriting, and source-anchor visual grounding, which jointly reduce shortcuts and one-step retrieval collapse. Based on this pipeline, we curate two training datasets, SearchVL-SFT-36k for SFT and SearchVL-RL-8k for RL. Besides, we design a diverse tool environment that unifies text search, image search, OCR, cropping, sharpening, super-resolution, and perspective correction, enabling agents to combine active perception with external knowledge acquisition. Finally, we propose a multi-turn fatal-aware GRPO training algorithm that handles cascading tool failures by masking post-failure tokens while preserving useful pre-failure reasoning through one-sided advantage clamping. Built on this recipe, OpenSearch-VL delivers substantial performance gains, with over 10-point average improvements across seven benchmarks, and achieves results comparable to proprietary commercial models on several tasks. We will release all data, code, and models to support open research on multimodal deep search agents.
PDF801May 8, 2026