ASTRA:智能体轨迹与强化竞技场的自动化合成
ASTRA: Automated Synthesis of agentic Trajectories and Reinforcement Arenas
January 29, 2026
作者: Xiaoyu Tian, Haotian Wang, Shuaiting Chen, Hao Zhou, Kaichi Yu, Yudian Zhang, Jade Ouyang, Junxi Yin, Jiong Chen, Baoyan Guo, Lei Zhang, Junjie Tao, Yuansheng Song, Ming Cui, Chengwei Liu
cs.AI
摘要
大型语言模型(LLMs)作为工具增强型智能体正日益广泛应用于多步骤决策任务,但训练稳健的工具使用智能体仍具挑战性。现有方法仍需人工干预,依赖不可验证的模拟环境,仅采用监督微调(SFT)或强化学习(RL)单一范式,且难以实现稳定的长周期多轮次学习。为应对这些挑战,我们提出ASTRA框架——通过可扩展数据合成与可验证强化学习,实现工具增强型语言模型智能体全自动端到端训练。ASTRA集成两大互补组件:首先,基于工具调用图静态拓扑结构的数据流水线可合成多样化、结构化的轨迹序列,从而培养广泛可迁移的工具使用能力;其次,通过捕捉人类语义推理的丰富组合拓扑,环境合成框架将分解后的问答轨迹转化为独立、可代码执行且规则可验证的环境,实现确定性多轮次强化学习。基于该方法,我们开发了统一训练方案:利用轨迹级奖励整合SFT与在线RL,平衡任务完成度与交互效率。在多个工具使用基准测试中,ASTRA训练的模型在同等规模下达到最先进性能,在保持核心推理能力的同时逼近闭源系统水平。我们已开源完整流水线、环境配置及训练模型:https://github.com/LianjiaTech/astra。
English
Large language models (LLMs) are increasingly used as tool-augmented agents for multi-step decision making, yet training robust tool-using agents remains challenging. Existing methods still require manual intervention, depend on non-verifiable simulated environments, rely exclusively on either supervised fine-tuning (SFT) or reinforcement learning (RL), and struggle with stable long-horizon, multi-turn learning. To address these challenges, we introduce ASTRA, a fully automated end-to-end framework for training tool-augmented language model agents via scalable data synthesis and verifiable reinforcement learning. ASTRA integrates two complementary components. First, a pipeline that leverages the static topology of tool-call graphs synthesizes diverse, structurally grounded trajectories, instilling broad and transferable tool-use competence. Second, an environment synthesis framework that captures the rich, compositional topology of human semantic reasoning converts decomposed question-answer traces into independent, code-executable, and rule-verifiable environments, enabling deterministic multi-turn RL. Based on this method, we develop a unified training methodology that integrates SFT with online RL using trajectory-level rewards to balance task completion and interaction efficiency. Experiments on multiple agentic tool-use benchmarks demonstrate that ASTRA-trained models achieve state-of-the-art performance at comparable scales, approaching closed-source systems while preserving core reasoning ability. We release the full pipelines, environments, and trained models at https://github.com/LianjiaTech/astra.