EnvFactory: 실행 가능 환경 합성 및 강건한 강화학습을 통한 도구 사용 에이전트 확장

초록

에이전틱 강화학습(Agentic RL)을 통해 LLM에 도구 사용 능력을 부여하는 것은 확장 가능하고 견고한 실행 환경의 부족과 암묵적 인간 추론을 포착하는 현실적인 훈련 데이터의 scarcity라는 두 가지 과제에 의해 병목 현상을 겪고 있다. 기존 접근 방식은 비용이 많이 드는 실제 API, 환각에 취약한 LLM 시뮬레이터, 또는 종종 단일 턴이거나 사전 수집된 문서에 의존하는 합성 환경에 의존한다. 더욱이 합성 궤적은 종종 과도하게 특정되어 있어 자연스러운 인간 의도보다는 명령 시퀀스에 가깝기 때문에 RL 훈련의 효과를 감소시킨다. 우리는 두 과제를 모두 해결하는 완전 자동화 프레임워크인 EnvFactory를 소개한다. EnvFactory는 신뢰할 수 있는 자원으로부터 상태 기반의 실행 가능한 도구 환경을 자발적으로 탐색하고 검증하며, 위상 인식 샘플링과 보정된 정제를 통해 자연스러운 다중 턴 궤적을 합성하여 암묵적 의도를 지닌 근거 기반 쿼리를 생성한다. 7개 도메인에 걸쳐 단 85개의 검증된 환경만을 사용하여 EnvFactory는 2,575개의 SFT 및 RL 궤적을 생성한다. 이전 연구에서 종종 5배 더 많은 환경을 사용했음에도 불구하고 EnvFactory는 더 적은 환경을 사용하여 우수한 훈련 효율성과 하위 성능을 달성하며, Qwen3 시리즈 모델을 BFCLv3에서 최대 +15%, MCP-Atlas에서 +8.6%, 그리고 τ²-Bench 및 VitaBench를 포함한 대화형 벤치마크에서 +6%까지 개선한다. 환경 구축과 궤적 합성을 모두 완전 자동화함으로써 EnvFactory는 에이전틱 RL을 위한 확장 가능하고 확장성이 높으며 견고한 기반을 제공한다.

English

Equipping LLMs with tool-use capabilities via Agentic Reinforcement Learning (Agentic RL) is bottlenecked by two challenges: the lack of scalable, robust execution environments and the scarcity of realistic training data that captures implicit human reasoning. Existing approaches depend on costly real-world APIs, hallucination-prone LLM simulators, or synthetic environments that are often single-turn or depend on pre-collected documents. Moreover, synthetic trajectories are frequently over-specified, resembling instruction sequences rather than natural human intents, reducing their effectiveness for RL training. We introduce EnvFactory, a fully automated framework that addresses both challenges. EnvFactory autonomously explores and verifies stateful, executable tool environments from authentic resources, and synthesizes natural multi-turn trajectories through topology-aware sampling and calibrated refinement, producing grounded queries with implicit intents. Using only 85 verified environments across 7 domains, EnvFactory generates 2,575 SFT and RL trajectories. Despite using significantly fewer environments than prior work, which are often 5 times more, EnvFactory achieves superior training efficiency and downstream performance, improving Qwen3-series models by up to +15% on BFCLv3, +8.6% on MCP-Atlas, and +6% on conversational benchmarks including τ^2-Bench and VitaBench. By fully automating both environment construction and trajectory synthesis, EnvFactory provides a scalable, extensible, and robust foundation for Agentic RL.