从模仿到甄别:构建通用课程优势机制以增强跨领域推理任务
From Imitation to Discrimination: Toward A Generalized Curriculum Advantage Mechanism Enhancing Cross-Domain Reasoning Tasks
December 2, 2025
作者: Changpeng Yang, Jinyang Wu, Yuchen Liu, Shuai Zhang, Yang Li, Qiliang Liang, Hongzhen Wang, Shuai Nie, Jiaming Xu, Runyu Shi, Ying Huang, Guoquan Zhang
cs.AI
摘要
强化学习已成为大型语言模型后训练的一种范式,显著提升了其推理能力。此类方法通过计算每个样本的优势值,反映其表现优于或劣于预期的程度,从而为训练提供正负双向信号。然而,现有方法从早期阶段就 indiscriminately 将两种信号混合使用,可能导致训练指引模糊且收益有限。为解决这一问题,我们提出**CAPO**(课程优势策略优化),一种基于优势信号的自适应课程机制。该机制首先通过仅含正向优势的样本启动模仿学习以夯实基础,随后引入负向信号以培养判别能力,从而提升模型在复杂场景中的泛化性能。本方法与GRPO、PPO、RLOO、Reinforce++等多种优化方法兼容,在数学推理任务中持续取得稳定且显著的性能提升,并能有效泛化至多模态图形用户界面推理场景,展现出其作为通用鲁棒优化框架的潜力。
English
Reinforcement learning has emerged as a paradigm for post-training large language models, boosting their reasoning capabilities. Such approaches compute an advantage value for each sample, reflecting better or worse performance than expected, thereby yielding both positive and negative signals for training. However, the indiscriminate mixing of the two signals in existing methods, especially from the early stages, may lead to ambiguous guidance and limited gains. To address this issue, we propose **CAPO** (**C**urriculum **A**dvantage **P**olicy **O**ptimization), an adaptive curriculum mechanism based on advantage signals. The proposed mechanism bootstraps imitation learning with positive-only advantage samples to establish robust foundations, and subsequently introduces negative signals to cultivate discriminative capabilities, thereby improving generalization across complex scenarios. Compatible with diverse optimization methods including GRPO, PPO, RLOO, and Reinforce++, our method consistently achieves stable and significant improvements in mathematical reasoning tasks, and further generalizes effectively to multimodal Graphical User Interface (GUI) reasoning scenarios, establishing itself as a versatile and robust optimization framework.