ChatPaper.aiChatPaper

面向視覺原生多模態深度搜索智能體的在線策略數據演化

Towards On-Policy Data Evolution for Visual-Native Multimodal Deep Search Agents

May 11, 2026
作者: Shijue Huang, Hangyu Guo, Chenxin Li, Junting Lu, Xinyu Geng, Zhaochen Su, Zhenyu Li, Shuang Chen, Hongru Wang, Yi R. Fung
cs.AI

摘要

多模態深度搜索要求代理透過連結搜索、工具使用及在動態變化的文字與視覺脈絡中進行視覺推理,來解決開放性問題。現有系統面臨兩大瓶頸。首先,現有的工具使用框架將搜索、瀏覽或轉換返回的影像視為暫存輸出,導致中間視覺證據無法被後續工具再次利用。其次,訓練資料通常透過固定的策展配方構建,無法追蹤目標代理不斷演進的能力。為應對這些挑戰,我們首先引入以影像庫參考協定為核心的視覺原生代理框架。該協定將每個工具返回的影像註冊為可定址的參考,使中間視覺證據可供後續工具重複使用。在此框架基礎上,同策略資料演化(ODE)運行一個閉環資料生成器,該生成器會根據正在訓練的策略所執行的展開結果,在各輪次之間進行自我精煉。這種逐輪精煉使每一輪的資料能對應當下策略仍待學習的重點。同一框架既能產生多樣化的監督式微調資料,也能產生具備策略感知能力的強化學習資料策展,涵蓋目標代理完整的訓練生命週期。在8項多模態深度搜索基準測試中,ODE將Qwen3-VL-8B代理的平均成績從24.9%提升至39.0%,在標準代理工作流程設定下超越Gemini-2.5 Pro(37.9%)。在30B規模下,ODE將平均分數從30.6%提升至41.5%。進一步分析驗證了影像庫重複使用的有效性,特別是在需要迭代式視覺精煉的複雜任務中;而與靜態合成相比,展開回饋演化能產生更紮實的SFT軌跡以及更符合策略需求的RL任務。
English
Multimodal deep search requires an agent to solve open-world problems by chaining search, tool use, and visual reasoning over evolving textual and visual context. Two bottlenecks limit current systems. First, existing tool-use harnesses treat images returned by search, browsing, or transformation as transient outputs, so intermediate visual evidence cannot be re-consumed by later tools. Second, training data is usually built by fixed curation recipes that cannot track the target agent's evolving capability. To address these challenges, we first introduce a visual-native agent harness centered on an image bank reference protocol, which registers every tool-returned image as an addressable reference and makes intermediate visual evidence reusable by later tools. On top of this harness, On-policy Data Evolution (ODE) runs a closed-loop data generator that refines itself across rounds from rollouts of the policy being trained. This per-round refinement makes each round's data target what the current policy still needs to learn. The same framework supports both diverse supervised fine-tuning data and policy-aware reinforcement learning data curation, covering the full training lifecycle of the target agent. Across 8 multimodal deep search benchmarks, ODE improves the Qwen3-VL-8B agent from 24.9% to 39.0% on average, surpassing Gemini-2.5 Pro in standard agent-workflow setting (37.9%). At 30B, ODE raises the average score from 30.6% to 41.5%. Further analyses validate the effectiveness of image-bank reuse, especially on complex tasks requiring iterative visual refinement, while rollout-feedback evolution yields more grounded SFT traces and better policy-matched RL tasks than static synthesis.
PDF192May 14, 2026