ASTRA: 자동화된 에이전트 행동 경로 합성 및 강화 학습 경기장

초록

대규모 언어 모델(LLM)은 도구를 활용하는 다단계 의사 결정 에이전트로 점차 확대 적용되고 있으나, 강건한 도구 사용 에이전트의 훈련은 여전히 어려운 과제입니다. 기존 방법론들은 여전히 수동 개입이 필요하고, 검증 불가능한 시뮬레이션 환경에 의존하며, 지도 미세 조정(SFT) 또는 강화 학습(RL) 중 한 가지에만 의존할 뿐 아니라 안정적인 장기·다중 턴 학습에 어려움을 겪고 있습니다. 이러한 문제들을 해결하기 위해, 우리는 확장 가능한 데이터 합성과 검증 가능한 강화 학습을 통해 도구 활용 언어 모델 에이전트를 훈련시키는 완전 자동화된 종단 간(end-to-end) 프레임워크인 ASTRA를 제안합니다. ASTRA는 두 가지 상호 보완적인 구성 요소를 통합합니다. 첫째, 도구 호출 그래프의 정적 토폴로지를 활용하는 파이프라인은 다양하고 구조적으로 견고한 궤적을 합성하여 폭넓고 전이 가능한 도구 사용 능력을 함양합니다. 둘째, 인간의 의미론적 추론의 풍부하고 구성적인 토폴로지를 포착하는 환경 합성 프레임워크는 분해된 질문-응답 흔적을 독립적이고 코드 실행이 가능하며 규칙 검증이 가능한 환경으로 변환하여 결정론적인 다중 턴 RL을 가능하게 합니다. 이 방법론을 바탕으로, 우리는 작업 완료와 상호작용 효율성의 균형을 맞추기 위해 궤적 수준 보상을 사용하여 SFT와 온라인 RL을 통합하는 통합 훈련 방법론을 개발합니다. 다양한 에이전트 도구 사용 벤치마크에서의 실험 결과, ASTRA로 훈련된 모델이 동등한 규모에서 최첨단 성능을 달성하며 핵심 추론 능력을 유지한 채 폐쇄형 시스템에 근접하는 것을 확인했습니다. 우리는 전체 파이프라인, 환경, 훈련된 모델을 https://github.com/LianjiaTech/astra 에 공개합니다.

English

Large language models (LLMs) are increasingly used as tool-augmented agents for multi-step decision making, yet training robust tool-using agents remains challenging. Existing methods still require manual intervention, depend on non-verifiable simulated environments, rely exclusively on either supervised fine-tuning (SFT) or reinforcement learning (RL), and struggle with stable long-horizon, multi-turn learning. To address these challenges, we introduce ASTRA, a fully automated end-to-end framework for training tool-augmented language model agents via scalable data synthesis and verifiable reinforcement learning. ASTRA integrates two complementary components. First, a pipeline that leverages the static topology of tool-call graphs synthesizes diverse, structurally grounded trajectories, instilling broad and transferable tool-use competence. Second, an environment synthesis framework that captures the rich, compositional topology of human semantic reasoning converts decomposed question-answer traces into independent, code-executable, and rule-verifiable environments, enabling deterministic multi-turn RL. Based on this method, we develop a unified training methodology that integrates SFT with online RL using trajectory-level rewards to balance task completion and interaction efficiency. Experiments on multiple agentic tool-use benchmarks demonstrate that ASTRA-trained models achieve state-of-the-art performance at comparable scales, approaching closed-source systems while preserving core reasoning ability. We release the full pipelines, environments, and trained models at https://github.com/LianjiaTech/astra.

ASTRA: 자동화된 에이전트 행동 경로 합성 및 강화 학습 경기장

ASTRA: Automated Synthesis of agentic Trajectories and Reinforcement Arenas

초록

Support