多任务GRPO：跨任务可靠的大型语言模型推理

摘要

基於強化學習的GRPO後訓練技術已被廣泛應用於提升大型語言模型在單一推理任務上的表現。然而，實際部署需要模型在多樣化任務中保持可靠性能。直接對GRPO進行多任務適配往往會導致優化失衡——某些任務主導優化過程，而其他任務則停滯不前。此外，不同任務中提示詞產生零優勢（即零梯度）的頻率存在顯著差異，這會進一步扭曲各任務對優化信號的有效貢獻。為解決這些問題，我們提出新型多任務GRPO（MT-GRPO）算法，其具備兩大核心機制：（i）動態調整任務權重，顯式優化最差任務表現以促進跨任務均衡進展；（ii）引入比率保持採樣器，確保任務級策略梯度能反映調整後的權重。在3任務和9任務場景下的實驗表明，MT-GRPO在最差任務準確率上持續超越基準方法。具體而言，相較標準GRPO和DAPO，MT-GRPO在最差任務性能上分別實現16-28%和6%的絕對提升，同時保持具有競爭力的平均準確率。值得注意的是，在3任務設定中，MT-GRPO僅需50%的訓練步數即可達到50%的最差任務準確率，顯著提升了實現跨任務可靠性能的訓練效率。

English

RL-based post-training with GRPO is widely used to improve large language models on individual reasoning tasks. However, real-world deployment requires reliable performance across diverse tasks. A straightforward multi-task adaptation of GRPO often leads to imbalanced outcomes, with some tasks dominating optimization while others stagnate. Moreover, tasks can vary widely in how frequently prompts yield zero advantages (and thus zero gradients), which further distorts their effective contribution to the optimization signal. To address these issues, we propose a novel Multi-Task GRPO (MT-GRPO) algorithm that (i) dynamically adapts task weights to explicitly optimize worst-task performance and promote balanced progress across tasks, and (ii) introduces a ratio-preserving sampler to ensure task-wise policy gradients reflect the adapted weights. Experiments on both 3-task and 9-task settings show that MT-GRPO consistently outperforms baselines in worst-task accuracy. In particular, MT-GRPO achieves 16-28% and 6% absolute improvement on worst-task performance over standard GRPO and DAPO, respectively, while maintaining competitive average accuracy. Moreover, MT-GRPO requires 50% fewer training steps to reach 50% worst-task accuracy in the 3-task setting, demonstrating substantially improved efficiency in achieving reliable performance across tasks.

多任务GRPO：跨任务可靠的大型语言模型推理

Multi-Task GRPO: Reliable LLM Reasoning Across Tasks

摘要

Support