DIVE：面向通用工具使用的代理任务合成多样性扩展

摘要

近期研究虽已实现面向后训练工具调用大模型的智能体任务合成，但任务与工具集变化下的稳健泛化能力仍是开放难题。我们将此脆弱性归因于合成任务的多样性不足。扩展多样性面临双重挑战：训练需确保任务可执行且可验证，而泛化要求覆盖多样工具类型、工具集组合及异构工具使用模式。为此，我们提出DIVE——一种证据驱动的逆向合成方案，通过先执行多样化的真实工具，再严格根据执行痕迹反推任务，实现构造层面的任务锚定。DIVE沿工具池覆盖度和单任务工具集多样性两个可控维度扩展结构多样性，并通过“证据收集-任务推导”循环在五大领域的373种工具上诱导出丰富的多步工具使用模式。基于DIVE数据（4.8万条SFT+3200条RL）训练Qwen3-8B模型，在9个OOD基准测试中平均提升22分，以68分优势超越最强8B基线。值得注意的是，控制变量分析表明：对于OOD泛化，多样性扩展始终优于数据量扩展，即使在数据量减少四分之三的情况下依然如此。

English

Recent work synthesizes agentic tasks for post-training tool-using LLMs, yet robust generalization under shifts in tasks and toolsets remains an open challenge. We trace this brittleness to insufficient diversity in synthesized tasks. Scaling diversity is difficult because training requires tasks to remain executable and verifiable, while generalization demands coverage of diverse tool types, toolset combinations, and heterogeneous tool-use patterns. We propose DIVE, an evidence-driven recipe that inverts synthesis order, executing diverse, real-world tools first and reverse-deriving tasks strictly entailed by the resulting traces, thereby providing grounding by construction. DIVE scales structural diversity along two controllable axes, tool-pool coverage and per-task toolset variety, and an Evidence Collection--Task Derivation loop further induces rich multi-step tool-use patterns across 373 tools in five domains. Training Qwen3-8B on DIVE data (48k SFT + 3.2k RL) improves by +22 average points across 9 OOD benchmarks and outperforms the strongest 8B baseline by +68. Remarkably, controlled scaling analysis reveals that diversity scaling consistently outperforms quantity scaling for OOD generalization, even with 4x less data.