面向视觉原生多模态深度搜索智能体的在线策略数据演化

摘要

多模态深度搜索要求智能体通过链式搜索、工具使用及对动态演化的文本与视觉上下文进行推理，来解决开放世界问题。当前系统面临两大瓶颈。其一，现有的工具使用框架将搜索、浏览或转换功能返回的图像视为瞬时输出，导致后续工具无法重新利用中间视觉证据。其二，训练数据通常通过固定化整理方案构建，无法追踪目标智能体不断演进的能力。为应对这些挑战，我们首先提出一种以图像库引用协议为核心的视觉原生智能体架构，该协议将每个工具返回的图像注册为可寻址引用，使中间视觉证据能被后续工具重复使用。在此架构基础上，我们进一步提出**在策略数据演化（On-policy Data Evolution, ODE）**方法——一种闭环数据生成器，它能根据被训练策略的展开结果，在轮次间自我优化。这种逐轮优化机制使得每轮产生的数据精准聚焦于当前策略仍需学习的内容。该框架既能生成多样化监督微调数据，也能支持策略感知的强化学习数据整理，覆盖目标智能体的完整训练生命周期。在8项多模态深度搜索基准测试中，ODE将Qwen3-VL-8B智能体的平均得分从24.9%提升至39.0%，在标准智能体工作流设置下超越Gemini-2.5 Pro（37.9%）。在30B参数规模下，ODE将平均得分从30.6%提升至41.5%。进一步分析验证了图像库复用的有效性，尤其是在需要迭代视觉精化的复杂任务中；而与静态合成方法相比，基于展开反馈的演化能生成更具依据的监督微调轨迹及更匹配策略的强化学习任务。

English

Multimodal deep search requires an agent to solve open-world problems by chaining search, tool use, and visual reasoning over evolving textual and visual context. Two bottlenecks limit current systems. First, existing tool-use harnesses treat images returned by search, browsing, or transformation as transient outputs, so intermediate visual evidence cannot be re-consumed by later tools. Second, training data is usually built by fixed curation recipes that cannot track the target agent's evolving capability. To address these challenges, we first introduce a visual-native agent harness centered on an image bank reference protocol, which registers every tool-returned image as an addressable reference and makes intermediate visual evidence reusable by later tools. On top of this harness, On-policy Data Evolution (ODE) runs a closed-loop data generator that refines itself across rounds from rollouts of the policy being trained. This per-round refinement makes each round's data target what the current policy still needs to learn. The same framework supports both diverse supervised fine-tuning data and policy-aware reinforcement learning data curation, covering the full training lifecycle of the target agent. Across 8 multimodal deep search benchmarks, ODE improves the Qwen3-VL-8B agent from 24.9% to 39.0% on average, surpassing Gemini-2.5 Pro in standard agent-workflow setting (37.9%). At 30B, ODE raises the average score from 30.6% to 41.5%. Further analyses validate the effectiveness of image-bank reuse, especially on complex tasks requiring iterative visual refinement, while rollout-feedback evolution yields more grounded SFT traces and better policy-matched RL tasks than static synthesis.

面向视觉原生多模态深度搜索智能体的在线策略数据演化

Towards On-Policy Data Evolution for Visual-Native Multimodal Deep Search Agents

摘要

Support