ChatPaper.aiChatPaper

多任务GRPO:跨任务可靠的大型语言模型推理

Multi-Task GRPO: Reliable LLM Reasoning Across Tasks

February 5, 2026
作者: Shyam Sundhar Ramesh, Xiaotong Ji, Matthieu Zimmer, Sangwoong Yoon, Zhiyong Wang, Haitham Bou Ammar, Aurelien Lucchi, Ilija Bogunovic
cs.AI

摘要

基于GRPO的强化学习后训练技术已被广泛应用于提升大语言模型在单一推理任务上的表现。然而,实际部署需要模型在多任务中均保持可靠性能。直接对GRPO进行多任务适配往往会导致优化失衡——某些任务主导优化过程,而其他任务则停滞不前。此外,不同任务中提示词产生零优势(进而导致零梯度)的频率差异巨大,这会进一步扭曲各任务对优化信号的实际贡献。为解决这些问题,我们提出新型多任务GRPO(MT-GRPO)算法,其特点在于:(i)动态调整任务权重,显式优化最弱任务表现以促进多任务均衡进步;(ii)引入比率保持采样器,确保任务级策略梯度能反映调整后的权重。在3任务和9任务场景下的实验表明,MT-GRPO在最弱任务准确率上持续超越基线方法。具体而言,相较于标准GRPO和DAPO,MT-GRPO在最弱任务性能上分别实现16-28%和6%的绝对提升,同时保持具有竞争力的平均准确率。此外在3任务场景中,MT-GRPO达到50%最弱任务准确率所需的训练步数减少50%,显著提升了实现跨任务可靠性能的训练效率。
English
RL-based post-training with GRPO is widely used to improve large language models on individual reasoning tasks. However, real-world deployment requires reliable performance across diverse tasks. A straightforward multi-task adaptation of GRPO often leads to imbalanced outcomes, with some tasks dominating optimization while others stagnate. Moreover, tasks can vary widely in how frequently prompts yield zero advantages (and thus zero gradients), which further distorts their effective contribution to the optimization signal. To address these issues, we propose a novel Multi-Task GRPO (MT-GRPO) algorithm that (i) dynamically adapts task weights to explicitly optimize worst-task performance and promote balanced progress across tasks, and (ii) introduces a ratio-preserving sampler to ensure task-wise policy gradients reflect the adapted weights. Experiments on both 3-task and 9-task settings show that MT-GRPO consistently outperforms baselines in worst-task accuracy. In particular, MT-GRPO achieves 16-28% and 6% absolute improvement on worst-task performance over standard GRPO and DAPO, respectively, while maintaining competitive average accuracy. Moreover, MT-GRPO requires 50% fewer training steps to reach 50% worst-task accuracy in the 3-task setting, demonstrating substantially improved efficiency in achieving reliable performance across tasks.
PDF65February 7, 2026