ChatPaper.aiChatPaper

從模仿到鑑別:建構通用課程優勢機制以提升跨領域推理任務

From Imitation to Discrimination: Toward A Generalized Curriculum Advantage Mechanism Enhancing Cross-Domain Reasoning Tasks

December 2, 2025
作者: Changpeng Yang, Jinyang Wu, Yuchen Liu, Shuai Zhang, Yang Li, Qiliang Liang, Hongzhen Wang, Shuai Nie, Jiaming Xu, Runyu Shi, Ying Huang, Guoquan Zhang
cs.AI

摘要

強化學習已成為大型語言模型後訓練的重要範式,能有效提升其推理能力。此類方法會為每個樣本計算優勢值,反映其表現優於或劣於預期的程度,從而產生正負雙向的訓練信號。然而現有方法從訓練早期便無差別混合正負信號,可能導致指導方向模糊與收益受限。為解決此問題,我們提出**CAPO**(課程優勢策略優化),一種基於優勢信號的自適應課程機制。該機制首先通過僅含正向優勢的樣本進行模仿學習來奠定堅實基礎,隨後引入負向信號以培養判別能力,從而提升模型在複雜場景中的泛化性能。本方法兼容GRPO、PPO、RLOO、Reinforce++等多種優化算法,在數學推理任務中持續取得穩定且顯著的提升,並能有效泛化至多模態圖形用戶界面(GUI)推理場景,展現其作為通用魯棒優化框架的優勢。
English
Reinforcement learning has emerged as a paradigm for post-training large language models, boosting their reasoning capabilities. Such approaches compute an advantage value for each sample, reflecting better or worse performance than expected, thereby yielding both positive and negative signals for training. However, the indiscriminate mixing of the two signals in existing methods, especially from the early stages, may lead to ambiguous guidance and limited gains. To address this issue, we propose **CAPO** (**C**urriculum **A**dvantage **P**olicy **O**ptimization), an adaptive curriculum mechanism based on advantage signals. The proposed mechanism bootstraps imitation learning with positive-only advantage samples to establish robust foundations, and subsequently introduces negative signals to cultivate discriminative capabilities, thereby improving generalization across complex scenarios. Compatible with diverse optimization methods including GRPO, PPO, RLOO, and Reinforce++, our method consistently achieves stable and significant improvements in mathematical reasoning tasks, and further generalizes effectively to multimodal Graphical User Interface (GUI) reasoning scenarios, establishing itself as a versatile and robust optimization framework.
PDF272December 9, 2025