ChatPaper.aiChatPaper

大语言模型推理的群分布鲁棒优化驱动强化学习方法

Group Distributionally Robust Optimization-Driven Reinforcement Learning for LLM Reasoning

January 27, 2026
作者: Kishan Panaganti, Zhenwen Liang, Wenhao Yu, Haitao Mi, Dong Yu
cs.AI

摘要

近年来,大语言模型推理能力的进步日益依赖于训练后损失函数与对齐策略的优化。然而,传统强化学习范式(如分组相对策略优化)仍受限于静态均匀性约束:均匀的提示词采样和固定次数的每提示词推演。对于异构、重尾分布的推理数据,这种机制会导致结构性低效——既在已掌握模式上浪费算力,又对困难问题的长尾部分训练不足。为此,我们提出多对抗者分组分布鲁棒优化框架,这是一种以优化为先导的方法,通过动态调整训练分布突破均匀推理模型的限制。 我们引入了在线难度分类器,将提示词动态划分为基于pass@k指标的难度分组。随后提出两个独立的训练后GDRO博弈机制:(1)提示词-GDRO采用指数移动平均去偏的乘性权重赌博机采样器,精准聚焦高强度难度边界,在避免频率偏差的前提下持续提升顽固困难组的权重;(2)推演-GDRO通过影子价格控制器在组间重新分配推演次数,在固定平均算力预算下实现困难任务梯度方差削减的最大化。我们为两个控制器提供了无悔保证,并对推演-GDRO进行了方差代理分析,推导出平方根最优的推演分配方案。 基于Qwen3-Base模型在DAPO 14.1k数据集上的实验表明:在1.7B、4B和8B参数规模下,提示词-GDRO与推演-GDRO在pass@8准确率上相较GRPO基线分别实现平均10.6%和10.1%的相对提升。定性分析揭示了 emergent curriculum 现象:对抗者将资源向持续演进的推理边界倾斜,从而显著增强推理模型的性能表现。
English
Recent progress in Large Language Model (LLM) reasoning is increasingly driven by the refinement of post-training loss functions and alignment strategies. However, standard Reinforcement Learning (RL) paradigms like Group Relative Policy Optimization (GRPO) remain constrained by static uniformity: uniform prompt sampling and a fixed number of rollouts per prompt. For heterogeneous, heavy-tailed reasoning data, this creates structural inefficiencies that waste compute on already-solved patterns while under-training the long tail of hard problems. To address this, we propose Multi-Adversary Group Distributionally Robust Optimization (GDRO), an optimization-first framework that moves beyond uniform reasoning models by dynamically adapting the training distribution. We introduce an Online Difficulty Classifier that partitions prompts into dynamic pass@k difficulty groups. We then propose two independent GDRO games for post-training: (1) Prompt-GDRO, which employs an EMA-debiased multiplicative-weights bandit sampler to target the intensive difficulty margin and upweight persistently hard groups without frequency bias; and (2) Rollout-GDRO, which uses a shadow-price controller to reallocate rollouts across groups, maximizing gradient variance reduction on hard tasks under a fixed mean budget (compute-neutral). We provide no-regret guarantees for both controllers and additionally a variance-proxy analysis motivating a square-root optimal rollout allocation for Rollout-GDRO. We validate our framework on the DAPO 14.1k dataset using Qwen3-Base models. Prompt-GDRO and Rollout-GDRO achieve average relative gains of +10.6% and +10.1%, respectively, in pass@8 accuracy across 1.7B, 4B, and 8B scales compared to the GRPO baseline. Qualitative analysis shows an emergent curriculum: the adversaries shift resources to the evolving reasoning frontier, enhancing the reasoning model's performance.
PDF21January 30, 2026