DIVE：面向可泛化工具使用的代理任务合成多样性扩展

摘要

近期研究雖已能合成適用於後訓練工具型大語言模型的代理任務，但在任務與工具集發生變化時的魯棒泛化能力仍是開放性難題。我們將此脆弱性歸因於合成任務多樣性不足。提升多樣性面臨雙重挑戰：訓練要求任務必須具備可執行性與可驗證性，而泛化能力則需要覆蓋多樣化的工具類型、工具集組合及異構工具使用模式。為此，我們提出證據驅動的DIVE方法，通過逆轉合成流程——先執行多樣化的真實工具，再嚴格根據執行軌跡反向推導任務，從而實現建構層面的實證基礎。DIVE沿兩個可控維度擴展結構多樣性：工具池覆蓋率與單任務工具集多樣性，並通過"證據收集-任務推導"循環在五大領域的373個工具中誘導出豐富的多步驟工具使用模式。基於DIVE數據（4.8萬條SFT+3200條RL）訓練的Qwen3-8B模型，在9個OOD基準測試中平均提升22分，較最強8B基線模型領先68分。值得注意的是，可控擴展分析表明：對於OOD泛化能力，多樣性擴展始終優於數據量擴展，即使在數據量減少四倍的情況下依然如此。

English

Recent work synthesizes agentic tasks for post-training tool-using LLMs, yet robust generalization under shifts in tasks and toolsets remains an open challenge. We trace this brittleness to insufficient diversity in synthesized tasks. Scaling diversity is difficult because training requires tasks to remain executable and verifiable, while generalization demands coverage of diverse tool types, toolset combinations, and heterogeneous tool-use patterns. We propose DIVE, an evidence-driven recipe that inverts synthesis order, executing diverse, real-world tools first and reverse-deriving tasks strictly entailed by the resulting traces, thereby providing grounding by construction. DIVE scales structural diversity along two controllable axes, tool-pool coverage and per-task toolset variety, and an Evidence Collection--Task Derivation loop further induces rich multi-step tool-use patterns across 373 tools in five domains. Training Qwen3-8B on DIVE data (48k SFT + 3.2k RL) improves by +22 average points across 9 OOD benchmarks and outperforms the strongest 8B baseline by +68. Remarkably, controlled scaling analysis reveals that diversity scaling consistently outperforms quantity scaling for OOD generalization, even with 4x less data.

DIVE：面向可泛化工具使用的代理任务合成多样性扩展

DIVE: Scaling Diversity in Agentic Task Synthesis for Generalizable Tool Use

摘要

Support