解锁隐性经验:从文本生成工具使用轨迹
Unlocking Implicit Experience: Synthesizing Tool-Use Trajectories from Text
January 15, 2026
作者: Zhihao Xu, Rumei Li, Jiahuan Li, Rongxiang Weng, Jingang Wang, Xunliang Cai, Xiting Wang
cs.AI
摘要
让大型语言模型(LLMs)在多轮交互中有效使用工具,是构建强大自主智能体的关键。然而,获取多样化且真实的多轮工具使用数据仍面临巨大挑战。本研究提出了一种创新的文本驱动范式:我们发现文本语料库天然蕴含丰富的多步骤问题解决经验,可作为多轮工具使用任务中尚未开发、可扩展且真实的数据源。基于此,我们提出了GEM数据合成框架,通过相关性筛选、工作流与工具提取、轨迹锚定及复杂度优化四阶段流程,实现从文本语料中生成并提取多轮工具使用轨迹。为降低计算成本,我们进一步通过监督微调训练了专用轨迹合成器,将复杂的生成流程蒸馏为高效的端到端轨迹生成器。实验表明,GEM-32B模型在BFCL V3多轮基准测试中实现了16.5%的性能提升。我们的模型在部分场景下甚至超越了使用τ-bench(航空与零售领域)内部数据训练的模型,凸显了文本驱动合成范式带来的卓越泛化能力。值得注意的是,轨迹合成器在保持全流程生成质量的同时,显著降低了推理延迟与成本。
English
Enabling Large Language Models (LLMs) to effectively utilize tools in multi-turn interactions is essential for building capable autonomous agents. However, acquiring diverse and realistic multi-turn tool-use data remains a significant challenge. In this work, we propose a novel text-based paradigm. We observe that textual corpora naturally contain rich, multi-step problem-solving experiences, which can serve as an untapped, scalable, and authentic data source for multi-turn tool-use tasks. Based on this insight, we introduce GEM, a data synthesis pipeline that enables the generation and extraction of multi-turn tool-use trajectories from text corpora through a four-stage process: relevance filtering, workflow & tool extraction, trajectory grounding, and complexity refinement. To reduce the computational cost, we further train a specialized Trajectory Synthesizer via supervised fine-tuning. This model distills the complex generation pipeline into an efficient, end-to-end trajectory generator. Experiments demonstrate that our GEM-32B achieve a 16.5% improvement on the BFCL V3 Multi-turn benchmark. Our models partially surpass the performance of models trained on τ - bench (Airline and Retail) in-domain data, highlighting the superior generalization capability derived from our text-based synthesis paradigm. Notably, our Trajectory Synthesizer matches the quality of the full pipeline while significantly reducing inference latency and costs.