ChatPaper.aiChatPaper

基于群体分布鲁棒优化的强化学习驱动大语言模型推理

Group Distributionally Robust Optimization-Driven Reinforcement Learning for LLM Reasoning

January 27, 2026
作者: Kishan Panaganti, Zhenwen Liang, Wenhao Yu, Haitao Mi, Dong Yu
cs.AI

摘要

大型语言模型(LLM)推理能力的最新进展日益依赖于训练后损失函数与对齐策略的优化。然而,诸如群体相对策略优化(GRPO)等标准强化学习(RL)范式仍受限于静态均匀性约束:即均匀的提示采样策略和固定数量的每提示滚动计算。对于异构、重尾分布的推理数据,这种机制会导致结构性低效——既在已解决的模式上浪费算力,又对困难问题的长尾部分训练不足。为此,我们提出多对抗者群体分布鲁棒优化(GDRO),这是一个以优化为核心的框架,通过动态调整训练分布来突破均匀推理模型的限制。 我们引入了在线难度分类器,将提示动态划分为基于pass@k指标的难度组别,进而提出两个独立的训练后GDRO博弈机制:(1)提示-GDRO采用指数移动平均去偏的乘性权重老虎机采样器,聚焦于高强度难度边界,对持续困难组别进行无频率偏差的加权提升;(2)滚动-GDRO利用影子价格控制器在组间重新分配滚动计算资源,在固定平均预算(计算中立)条件下最大化困难任务的梯度方差削减效果。我们为两个控制器提供了无悔保证,并对滚动-GDRO进行了方差代理分析,推导出平方根最优的滚动分配方案。在DAPO 14.1k数据集上使用Qwen3-Base模型的实验表明:相较于GRPO基线,提示-GDRO与滚动-GDRO在1.7B、4B和8B规模模型的pass@8准确率上分别实现+10.6%和+10.1%的平均相对提升。定性分析揭示了涌现的课程学习特性:对抗者将资源动态调配至持续演进的推理前沿,从而显著增强推理模型的性能。
English
Recent progress in Large Language Model (LLM) reasoning is increasingly driven by the refinement of post-training loss functions and alignment strategies. However, standard Reinforcement Learning (RL) paradigms like Group Relative Policy Optimization (GRPO) remain constrained by static uniformity: uniform prompt sampling and a fixed number of rollouts per prompt. For heterogeneous, heavy-tailed reasoning data, this creates structural inefficiencies that waste compute on already-solved patterns while under-training the long tail of hard problems. To address this, we propose Multi-Adversary Group Distributionally Robust Optimization (GDRO), an optimization-first framework that moves beyond uniform reasoning models by dynamically adapting the training distribution. We introduce an Online Difficulty Classifier that partitions prompts into dynamic pass@k difficulty groups. We then propose two independent GDRO games for post-training: (1) Prompt-GDRO, which employs an EMA-debiased multiplicative-weights bandit sampler to target the intensive difficulty margin and upweight persistently hard groups without frequency bias; and (2) Rollout-GDRO, which uses a shadow-price controller to reallocate rollouts across groups, maximizing gradient variance reduction on hard tasks under a fixed mean budget (compute-neutral). We provide no-regret guarantees for both controllers and additionally a variance-proxy analysis motivating a square-root optimal rollout allocation for Rollout-GDRO. We validate our framework on the DAPO 14.1k dataset using Qwen3-Base models. Prompt-GDRO and Rollout-GDRO achieve average relative gains of +10.6% and +10.1%, respectively, in pass@8 accuracy across 1.7B, 4B, and 8B scales compared to the GRPO baseline. Qualitative analysis shows an emergent curriculum: the adversaries shift resources to the evolving reasoning frontier, enhancing the reasoning model's performance.
PDF21January 30, 2026