효과적인 계획 및 도구 사용을 위한 플로우 내 에이전트 시스템 최적화

초록

결과 중심 강화 학습(Outcome-driven reinforcement learning)은 대규모 언어 모델(LLMs)의 추론 능력을 발전시켜 왔지만, 현재 주류의 도구-보강 접근법은 전체 컨텍스트 하에서 사고와 도구 호출을 교차시키는 단일 통합 정책을 학습하는 방식으로, 이는 장기적인 작업과 다양한 도구에 대해 확장성이 떨어지며 새로운 시나리오에 대한 일반화 능력이 약합니다. 에이전트 시스템은 작업을 전문화된 모듈로 분해하여 유망한 대안을 제공하지만, 대부분은 학습 없이 동작하거나 다중 턴 상호작용의 실시간 동역학과 분리된 오프라인 학습에 의존합니다. 우리는 AgentFlow를 소개합니다. 이는 학습 가능한 실시간 에이전트 프레임워크로, 네 가지 모듈(플래너, 실행자, 검증자, 생성자)을 진화하는 메모리를 통해 조율하며, 다중 턴 루프 내에서 플래너를 직접 최적화합니다. 실시간 환경에서 온-정책 학습을 위해, 우리는 Flow-based Group Refined Policy Optimization(Flow-GRPO)을 제안합니다. 이는 장기적이고 희소한 보상의 신용 할당 문제를 해결하기 위해 다중 턴 최적화를 다루기 쉬운 단일 턴 정책 업데이트 시퀀스로 변환합니다. 이는 검증 가능한 단일 궤적 수준의 결과를 모든 턴에 브로드캐스트하여 지역적 플래너 결정을 전역적 성공과 일치시키고, 그룹 정규화된 이점을 통해 학습을 안정화합니다. 10개의 벤치마크에서, 7B 규모의 백본을 가진 AgentFlow는 검색 작업에서 14.9%, 에이전트 작업에서 14.0%, 수학 작업에서 14.5%, 과학 작업에서 4.1%의 평균 정확도 향상을 보이며 최고 성능의 베이스라인을 능가했고, GPT-4o와 같은 더 큰 독점 모델도 능가했습니다. 추가 분석은 실시간 최적화의 이점을 확인하며, 개선된 계획, 강화된 도구 호출 신뢰성, 모델 크기와 추론 턴에 따른 긍정적인 확장성을 보여줍니다.

English

Outcome-driven reinforcement learning has advanced reasoning in large language models (LLMs), but prevailing tool-augmented approaches train a single, monolithic policy that interleaves thoughts and tool calls under full context; this scales poorly with long horizons and diverse tools and generalizes weakly to new scenarios. Agentic systems offer a promising alternative by decomposing work across specialized modules, yet most remain training-free or rely on offline training decoupled from the live dynamics of multi-turn interaction. We introduce AgentFlow, a trainable, in-the-flow agentic framework that coordinates four modules (planner, executor, verifier, generator) through an evolving memory and directly optimizes its planner inside the multi-turn loop. To train on-policy in live environments, we propose Flow-based Group Refined Policy Optimization (Flow-GRPO), which tackles long-horizon, sparse-reward credit assignment by converting multi-turn optimization into a sequence of tractable single-turn policy updates. It broadcasts a single, verifiable trajectory-level outcome to every turn to align local planner decisions with global success and stabilizes learning with group-normalized advantages. Across ten benchmarks, AgentFlow with a 7B-scale backbone outperforms top-performing baselines with average accuracy gains of 14.9% on search, 14.0% on agentic, 14.5% on mathematical, and 4.1% on scientific tasks, even surpassing larger proprietary models like GPT-4o. Further analyses confirm the benefits of in-the-flow optimization, showing improved planning, enhanced tool-calling reliability, and positive scaling with model size and reasoning turns.

효과적인 계획 및 도구 사용을 위한 플로우 내 에이전트 시스템 최적화

In-the-Flow Agentic System Optimization for Effective Planning and Tool Use

초록

Support