A^2TGPO: 적응형 턴 단위 클리핑을 통한 에이전트 턴-그룹 정책 최적화

초록

에이전트형 대규모 언어 모델(LLM)의 강화 학습은 일반적으로 희소하고 궤적 수준의 결과 보상을 사용하므로, 다중 턴 상호작용 내에서 개별 도구 호출의 기여도를 평가하기 어렵습니다. 이러한 과정 신용 할당에 대한 기존 접근법은 추가적인 비용을 유발하는 별도의 외부 과정 보상 모델에 의존하거나, 궤적 다양성을 제한하면서 결과 신호를 재분배하는 것에 그치는 트리 기반 구조적 롤아웃에 의존합니다. 유망한 대안으로는 외부 평가자 없이 내재적 과정 신호로 정답에 대한 정책의 예측 확률 변화(정보 이득, IG)를 턴별로 활용하는 방법이 있습니다. 그러나 RL 훈련 루프 내에서 IG 신호를 활용한 기존 연구는 세 가지 체계적인 문제에 직면합니다: 이질적인 위치적 맥락을 가진 턴 간 정규화는 개별 턴의 상대적 가치를 왜곡할 수 있으며, 가변적인 항목 수의 누적은 이점의 크기가 궤적 깊이에 따라 변동하게 만들고, 고정된 클리핑 범위는 크게 다른 IG 신호를 가진 턴들에 대해 동일한 정책 업데이트를 적용합니다. 본 논문에서는 IG를 내재적 신호로 유지하지만 이를 정규화, 누적, 활용하는 방식을 재설계한 A^2TGPO(적응형 턴 수준 클리핑을 통한 에이전트형 턴 그룹 정책 최적화)를 제안합니다: (i) 턴 그룹 정규화: 각 (프롬프트, 턴 인덱스) 그룹 내에서 IG를 정규화하여 동일한 상호작용 깊이의 동등한 턴들과만 비교하도록 합니다; (ii) 분산 재조정 할인 누적: 누적 정규화 IG를 누적 항목 수의 제곱근으로 나누어 턴 위치에 관계없이 이점 크기를 비교 가능하게 유지합니다; (iii) 적응형 턴 수준 클리핑: 각 턴의 정규화된 IG를 기반으로 클리핑 범위를 조절하여 정보량이 많은 턴은 업데이트 영역을 넓히고 정보량이 적은 턴은 좁힙니다.

English

Reinforcement learning for agentic large language models (LLMs) typically relies on a sparse, trajectory-level outcome reward, making it difficult to evaluate the contribution of individual tool-calls within multi-turn interactions. Existing approaches to such process credit assignment either depend on separate external process reward models that introduce additional consumption, or tree-based structural rollout that merely redistributes the outcome signal while constraining trajectory diversity. A promising alternative leverages the per-turn change in the policy's predicted probability of the ground-truth, termed Information Gain (IG), as an intrinsic process signal without an external evaluator. However, prior work on leveraging IG signals within the RL training loop faces three systematic challenges: normalizing across turns that face heterogeneous positional contexts can distort the relative standing of individual turns, accumulating a variable number of terms causes advantage magnitudes to drift with trajectory depth, and a fixed clipping range governs policy updates identically for turns with vastly different IG signals. In this paper, we propose A^2TGPO (Agentic Turn-Group Policy Optimization with Adaptive Turn-level Clipping), which retains IG as the intrinsic signal but re-designs how it is normalized, accumulated, and consumed: (i) turn-group normalization: normalizes IG within each (prompt, turn-index) group so that each turn is compared only against peers at the same interaction depth; (ii) variance-rescaled discounted accumulation: divides cumulative normalized IG by square root of accumulated terms to keep advantage magnitudes comparable across turn positions; and (iii) adaptive turn-level clipping: modulates each turn's clipping range based on its normalized IG, widening the update region for informative turns and narrowing it for uninformative ones.

A^2TGPO: 적응형 턴 단위 클리핑을 통한 에이전트 턴-그룹 정책 최적화

A^2TGPO: Agentic Turn-Group Policy Optimization with Adaptive Turn-level Clipping

초록

Support