목록별 정책 최적화: LLM 응답 단체에 대한 목표 투영으로서의 그룹 기반 RLVR

초록

검증 가능한 보상 기반 강화학습(RLVR)은 대규모 언어 모델(LLM)의 사후 훈련에서 추론 능력을 촉진하기 위한 표준 접근법이 되었다. 기존 방법론 중에서는 그룹 기반 정책 경사법이 널리 사용되는데, 이는 프롬프트당 응답 그룹을 샘플링하고 그룹 상대적 이점 신호를 통해 정책을 업데이트한다. 본 연구는 이러한 최적화 전략들이 공통된 기하학적 구조를 공유함을 밝힌다. 각 전략은 응답 단체 상에서 암묵적으로 목표 분포를 정의하고, 1차 근사를 통해 이를 향해 투영한다. 이 통찰을 바탕으로, 우리는 목표 투영을 명시적으로 수행하는 목록별 정책 최적화(Listwise Policy Optimization, LPO)를 제안한다. 이는 근접 강화학습 목적 함수를 응답 단체로 제한함으로써 암묵적 목표를 명확히 하고, 이후 정확한 발산 최소화를 통해 정책을 투영한다. 이 프레임워크는 (i) 유계이고, 제로섬이며, 자기 교정적 투영 기울기를 통해 목록별 목적 함수에 대한 단조적 개선을 제공하고, (ii) 분리된 투영 단계를 통해 고유한 구조적 특성을 지닌 발산 선택의 유연성을 제공한다. 다양한 추론 작업과 LLM 백본에서 LPO는 일치된 목표 하에서 일반적인 정책 경사법 기준선 대비 일관되게 훈련 성능을 향상시키며, 최적화 안정성과 응답 다양성을 본질적으로 유지한다.

English

Reinforcement learning with verifiable rewards (RLVR) has become a standard approach for large language models (LLMs) post-training to incentivize reasoning capacity. Among existing recipes, group-based policy gradient is prevalent, which samples a group of responses per prompt and updates the policy via group-relative advantage signals. This work reveals that these optimization strategies share a common geometric structure: each implicitly defines a target distribution on the response simplex and projects toward it via first-order approximation. Building on this insight, we propose Listwise Policy Optimization (LPO) to explicitly conduct the target-projection, which demystifies the implicit target by restricting the proximal RL objective to the response simplex, and then projects the policy via exact divergence minimization. This framework provides (i) monotonic improvement on the listwise objective with bounded, zero-sum, and self-correcting projection gradients, and (ii) flexibility in divergence selection with distinct structural properties through the decoupled projection step. On diverse reasoning tasks and LLM backbones, LPO consistently improves training performance over typical policy gradient baselines under matched targets, while intrinsically preserving optimization stability and response diversity.