DVAO: 다중 보상 강화 학습을 위한 동적 분산 적응형 어드밴티지 최적화

초록

강화 학습은 대규모 언어 모델을 인간의 의도 및 작업 요구사항에 맞추기 위한 표준 패러다임이 되었다. 그룹 상대 정책 최적화는 근접 정책 최적화에 대한 효율적이고 가치 모델이 없는 대안을 제공하지만, 이를 실제 다중 보상 환경에 적용하는 것은 여전히 어려운 과제이다. 표준 스칼라화 방법인 보상 결합과 이점 결합은 심각한 단점을 가진다: 보상 결합은 훈련 불안정을 초래하는 과도하게 큰 제곱 크기의 이점을 자주 생성하는 반면, 이점 결합은 정적 하이퍼파라미터에 의존하고 목표 간 상관관계를 무시한다. 이러한 한계를 해결하기 위해, 우리는 동적 분산 적응 이점 최적화(DVAO)를 제안한다. 이 방법은 롤아웃 그룹 내 각 목표의 경험적 보상 분산에 기반하여 결합 가중치를 동적으로 조정하며, 학습 신호가 강한 목표는 가중치를 높이고 잡음이 많은 목표는 억제한다. 우리는 DVAO가 안정적인 훈련을 위해 유계된 이점 크기를 유지하고 자체 적응형 교차 목표 정규화 메커니즘을 도입함을 수학적으로 증명한다. Qwen3 및 Qwen2.5 모델을 사용한 수학적 추론 및 도구 사용 벤치마크에 대한 광범위한 실험은 DVAO가 기준 방법보다 현저히 우수한 성능을 보여, 우수한 다중 목표 파레토 경계선과 강건한 훈련 안정성을 달성함을 입증한다.

English

Reinforcement Learning has become a standard paradigm for aligning Large Language Models with human intent and task requirements. While Group Relative Policy Optimization offers an efficient, value-model-free alternative to Proximal Policy Optimization, adapting it to real-world multi-reward settings remains challenging. Standard scalarization practices, such as Reward Combination and Advantage Combination, suffer from significant drawbacks: Reward Combination frequently generates advantages with excessively large squared magnitudes that lead to training instability, while Advantage Combination relies on static hyperparameters and ignores cross-objective correlations. To address these limitations, we propose Dynamic Variance-adaptive Advantage Optimization (DVAO), which dynamically adjusts combination weights based on the empirical reward variance of each objective within a rollout group, effectively up-weighting objectives with a stronger learning signal while suppressing noisy ones. We mathematically prove that DVAO maintains bounded advantage magnitudes for stable training and introduces a self-adaptive cross-objective regularization mechanism. Extensive experiments on mathematical reasoning and tool-use benchmarks using Qwen3 and Qwen2.5 models demonstrate that DVAO significantly outperforms baseline methods, achieving a superior multi-objective Pareto frontier and robust training stability.