ChatPaper.aiChatPaper

ASTRA:智慧體軌跡與強化競技場的自動化合成

ASTRA: Automated Synthesis of agentic Trajectories and Reinforcement Arenas

January 29, 2026
作者: Xiaoyu Tian, Haotian Wang, Shuaiting Chen, Hao Zhou, Kaichi Yu, Yudian Zhang, Jade Ouyang, Junxi Yin, Jiong Chen, Baoyan Guo, Lei Zhang, Junjie Tao, Yuansheng Song, Ming Cui, Chengwei Liu
cs.AI

摘要

大型語言模型(LLMs)作為工具增強型智能體正日益廣泛應用於多步驟決策任務,然而訓練具備魯棒性的工具使用智能體仍面臨挑戰。現有方法仍需人工干預、依賴不可驗證的模擬環境、僅採用監督微調(SFT)或強化學習(RL)單一訓練範式,且難以實現穩定的長週期多輪次學習。為解決這些難題,我們提出ASTRA——一個通過可擴展數據合成與可驗證強化學習來訓練工具增強型語言模型智能體的全自動端到端框架。ASTRA整合了兩個互補組件:首先,基於工具調用圖靜態拓撲的流水線能合成多樣化且結構化軌跡,培養智能體具備廣泛可遷移的工具使用能力;其次,通過捕捉人類語義推理的豐富組合拓撲,環境合成框架將分解後的問答軌跡轉換為獨立、可代碼執行且規則可驗證的環境,從而實現確定性多輪次強化學習。基於此方法,我們開發了統一的訓練方案,通過軌跡級獎勵整合SFT與在線RL,平衡任務完成度與交互效率。在多個工具使用智能體基準測試上的實驗表明,ASTRA訓練的模型在同等規模下達到最先進性能,逼近閉源系統水平同時保持核心推理能力。我們已開源完整流水線、環境及訓練模型:https://github.com/LianjiaTech/astra。
English
Large language models (LLMs) are increasingly used as tool-augmented agents for multi-step decision making, yet training robust tool-using agents remains challenging. Existing methods still require manual intervention, depend on non-verifiable simulated environments, rely exclusively on either supervised fine-tuning (SFT) or reinforcement learning (RL), and struggle with stable long-horizon, multi-turn learning. To address these challenges, we introduce ASTRA, a fully automated end-to-end framework for training tool-augmented language model agents via scalable data synthesis and verifiable reinforcement learning. ASTRA integrates two complementary components. First, a pipeline that leverages the static topology of tool-call graphs synthesizes diverse, structurally grounded trajectories, instilling broad and transferable tool-use competence. Second, an environment synthesis framework that captures the rich, compositional topology of human semantic reasoning converts decomposed question-answer traces into independent, code-executable, and rule-verifiable environments, enabling deterministic multi-turn RL. Based on this method, we develop a unified training methodology that integrates SFT with online RL using trajectory-level rewards to balance task completion and interaction efficiency. Experiments on multiple agentic tool-use benchmarks demonstrate that ASTRA-trained models achieve state-of-the-art performance at comparable scales, approaching closed-source systems while preserving core reasoning ability. We release the full pipelines, environments, and trained models at https://github.com/LianjiaTech/astra.
PDF514February 3, 2026