APPO: 에이전트 기반 절차적 정책 최적화

초록

최근 에이전트 강화학습(Agentic Reinforcement Learning) 분야의 발전은 대규모 언어 모델 에이전트의 다중 대화 도구 사용 능력을 크게 향상시켰다. 그러나 기존 방법들은 대부분 도구 호출 경계나 고정된 작업 흐름과 같은 조잡한 휴리스틱 단위를 기준으로 신용 할당을 수행하기 때문에, 어떤 중간 결정이 최종 결과에 영향을 미치는지 식별하기 어렵다. 본 연구에서는 에이전트 강화학습을 두 가지 관점, 즉 분기할 위치와 분기 후 신용 할당 방법으로 분석한다. 예비 분석 결과, 영향력 있는 결정 지점은 도구 호출에 집중되지 않고 생성된 시퀀스 전반에 걸쳐 널리 분포하며, 토큰 엔트로피만으로는 최종 결과에 미치는 영향을 신뢰성 있게 반영하지 못한다. 이러한 관찰에 기반하여, 우리는 분기와 신용 할당을 조잡한 상호작용 단위에서 시퀀스 내 세분화된 결정 지점으로 전환하는 APPO(Agentic Procedural Policy Optimization)를 제안한다. APPO는 토큰 불확실성과 정책에 의해 유도된 후속 연속 가능성 이득을 결합한 분기 점수(Branching Score)를 사용하여 분기 위치를 선택함으로써, 허위 높은 엔트로피 위치를 걸러내면서 더욱 표적화된 탐색을 가능하게 한다. 또한 절차 수준의 이점 스케일링(procedure-level advantage scaling)을 도입하여 분기된 롤아웃 간의 신용 분포를 개선한다. 13개 벤치마크 실험에서 APPO는 강력한 에이전트 강화학습 기준선 대비 약 4포인트의 일관된 성능 향상을 보였으며, 효율적인 도구 호출과 행동 해석 가능성을 유지하였다.

English

Recent advances in agentic Reinforcement Learning (RL) have substantially improved the multi-turn tool-use capabilities of large language model agents. However, most existing methods assign credit over coarse heuristic units, such as tool-call boundaries or fixed workflows, making it difficult to identify which intermediate decisions influence downstream outcomes. In this work, we study agentic RL from two perspectives: where to branch and how to assign credit after branching. Our pilot analysis shows that influential decision points are broadly distributed throughout the generated sequence rather than concentrated at tool calls, while token entropy alone does not reliably reflect their impact on final outcomes. Motivated by these observations, we propose Agentic Procedural Policy Optimization (APPO), which shifts branching and credit assignment from coarse interaction units to fine-grained decision points in the sequence. APPO selects branching locations using a Branching Score that combines token uncertainty with policy-induced likelihood gains of subsequent continuations, enabling more targeted exploration while filtering out spurious high-entropy positions. It further introduces procedure-level advantage scaling to better distribute credit across branched rollouts. Experiments on 13 benchmarks show that APPO consistently improves strong agentic RL baselines by nearly 4 points, while keeping efficient tool-calls and maintaining behavior interpretability.