균형 집계: GRPO의 집계 편향 이해 및 해결

초록

검증 가능한 보상을 활용한 강화 학습(RLVR)은 대규모 언어 모델의 추론 및 코드 생성 능력 향상을 위한 핵심 패러다임으로 자리 잡았으며, GRPO 스타일 학습은 그 간결함과 효과성으로 널리 채택되고 있습니다. 그러나 중요한 설계 선택지인 각 샘플링 그룹 내에서 토큰 수준 정책 그래디언트 항을 어떻게 집계할 것인가에 대한 문제는 충분히 탐구되지 않았습니다. 표준 GRPO는 시퀀스 집계를 사용하는 반면, 최근 연구에서는 토큰 집계가 더 나은 대안이라고 주장합니다. 본 연구는 이 두 규칙이 서로 다른 최적화 편향을 유발함을 보입니다: 토큰 집계는 부호-길이 결합을 도입하는 반면, 시퀀스 집계는 시퀀스 수준 동등 가중치를 통해 긴 응답을 암묵적으로 낮은 가중치로 처리합니다. 이러한 긴장 관계를 해결하기 위해 우리는 균형 집계(BA)를 제안합니다. 이는 긍정 및 부정 하위 집합 내에서 토큰 수준 평균을 별도로 계산한 후 시퀀스 개수 기반 가중치로 결합하는 간단한 대체 방법입니다. Qwen2.5-Math-7B 및 Qwen3-1.7B 모델을 사용하여 DAPO-17k와 Polaris 데이터셋으로 실험을 수행하고 6개의 추론 및 코딩 벤치마크에서 평가한 결과, BA가 표준 토큰 및 시퀀스 집계 대비 consistently 더 높은 학습 안정성과 최종 성능을 달성함을 확인했습니다. 우리의 분석은 더 나아가 토큰과 시퀀스 집계의 상대적 효과성이 주로 응답 길이 변동과 긍정-부정 길이 차이에 의해 크게 좌우됨을 보여주며, GRPO 스타일 RLVR에서 집계 방식이 중요한 설계 차원임을 강조합니다.

English

Reinforcement learning with verifiable rewards (RLVR) has become a central paradigm for improving reasoning and code generation in large language models, and GRPO-style training is widely adopted for its simplicity and effectiveness. However, an important design choice remains underexplored: how token-level policy gradient terms are aggregated within each sampled group. Standard GRPO uses sequence aggregation, while recent work has advocated token aggregation as a better alternative. We show that these two rules induce different optimization biases: token aggregation introduces sign-length coupling, while sequence aggregation implicitly downweights longer responses through sequence-level equal weighting. To address this tension, we propose Balanced Aggregation (BA), a simple drop-in replacement that computes token-level means separately within the positive and negative subsets and then combines them with sequence-count-based weights. Experiments with Qwen2.5-Math-7B and Qwen3-1.7B on DAPO-17k and Polaris, evaluated on six reasoning and coding benchmarks, show that BA consistently improves training stability and final performance over standard token and sequence aggregation. Our analysis further shows that the relative effectiveness of token and sequence aggregation is largely governed by response-length variation and the positive-negative length gap, highlighting aggregation as a critical design dimension in GRPO-style RLVR.

균형 집계: GRPO의 집계 편향 이해 및 해결

Balanced Aggregation: Understanding and Fixing Aggregation Bias in GRPO

초록

Support