ChatPaper.aiChatPaper

OpenSearch-VL:前沿多模態搜尋代理的開放實現方案

OpenSearch-VL: An Open Recipe for Frontier Multimodal Search Agents

May 6, 2026
作者: Shuang Chen, Kaituo Feng, Hangting Chen, Wenxuan Huang, Dasen Dai, Quanxin Shou, Yunlong Lin, Xiangyu Yue, Shenghua Gao, Tianyu Pang
cs.AI

摘要

深度搜尋已成為前沿多模態代理的關鍵能力,使模型能透過主動搜尋、證據驗證與多步推理來解決複雜問題。儘管進展迅速,頂級多模態搜尋代理仍難以復現,主因在於缺乏高品質開放訓練資料、透明的軌跡合成流程或詳細的訓練方案。為此,我們推出OpenSearch-VL——一個完全開源的訓練方案,透過代理強化學習來訓練前沿多模態深度搜尋代理。首先,我們設計專用流程來建構高品質訓練資料,包括維基百科路徑採樣、模糊實體改寫與源錨點視覺定位,共同減少捷徑學習與單步檢索坍縮問題。基於此流程,我們構建了兩個訓練資料集:用於SFT的SearchVL-SFT-36k與用於RL的SearchVL-RL-8k。此外,我們設計了統合文字搜尋、圖像搜尋、OCR、裁剪、銳化、超解析度與透視校正的多元工具環境,使代理能將主動感知與外部知識獲取相結合。最後,我們提出多輪致命錯誤感知的GRPO訓練演算法,透過遮罩失敗後令牌來處理級聯工具失效,同時利用單邊優勢鉗位保留失敗前的有效推理。基於此方案,OpenSearch-VL實現了顯著效能提升,在七項基準測試中平均提升超過10個百分點,並在多項任務中達到與專有商業模型相當的結果。我們將開源所有資料、程式碼與模型,以支持多模態深度搜尋代理的開放研究。
English
Deep search has become a crucial capability for frontier multimodal agents, enabling models to solve complex questions through active search, evidence verification, and multi-step reasoning. Despite rapid progress, top-tier multimodal search agents remain difficult to reproduce, largely due to the absence of open high-quality training data, transparent trajectory synthesis pipelines, or detailed training recipes. To this end, we introduce OpenSearch-VL, a fully open-source recipe for training frontier multimodal deep search agents with agentic reinforcement learning. First, we curated a dedicated pipeline to construct high-quality training data through Wikipedia path sampling, fuzzy entity rewriting, and source-anchor visual grounding, which jointly reduce shortcuts and one-step retrieval collapse. Based on this pipeline, we curate two training datasets, SearchVL-SFT-36k for SFT and SearchVL-RL-8k for RL. Besides, we design a diverse tool environment that unifies text search, image search, OCR, cropping, sharpening, super-resolution, and perspective correction, enabling agents to combine active perception with external knowledge acquisition. Finally, we propose a multi-turn fatal-aware GRPO training algorithm that handles cascading tool failures by masking post-failure tokens while preserving useful pre-failure reasoning through one-sided advantage clamping. Built on this recipe, OpenSearch-VL delivers substantial performance gains, with over 10-point average improvements across seven benchmarks, and achieves results comparable to proprietary commercial models on several tasks. We will release all data, code, and models to support open research on multimodal deep search agents.
PDF801May 8, 2026