TraPO：一种提升大语言模型推理能力的半监督强化学习框架

摘要

可验证奖励的强化学习（RLVR）通过利用答案可验证信号指导策略优化，在训练大型推理模型（LRMs）方面已证明有效，但该方法存在标注成本高的问题。为缓解此问题，近期研究探索了仅基于模型内部一致性（如通过熵和多数投票）推导奖励的无监督RLVR方法。尽管这些方法看似前景可观，但在训练后期常出现模型崩溃现象，这可能是由于缺乏外部监督时错误推理模式被强化所致。本文研究一种新型半监督RLVR范式，该范式利用少量标注样本指导未标注样本的RLVR训练。我们的核心洞见是：监督奖励对于稳定基于一致性的未标注样本训练至关重要，可确保仅将在标注实例上验证过的推理模式纳入RL训练。技术上，我们提出一种有效的策略优化算法TraPO，通过匹配未标注样本与标注样本的学习轨迹相似性来识别可靠样本。基于此，TraPO在六个常用数学推理基准（AIME24/25、AMC、MATH-500、Minerva和Olympiad）和三个分布外任务（ARC-c、GPQA-diamond和MMLU-pro）上实现了显著的数据效率和强大泛化能力。仅使用1K标注样本和3K未标注样本时，TraPO平均准确率达42.6%，超越在45K未标注样本上训练的最佳无监督方法（38.3%）。值得注意的是，当使用4K标注样本和12K未标注样本时，TraPO在所有基准上甚至优于使用全部45K标注样本训练的完全监督模型，而标注数据用量仅为其10%。代码已开源：https://github.com/ShenzhiYang2000/TRAPO。

English

Reinforcement learning with verifiable rewards (RLVR) has proven effective in training large reasoning models (LRMs) by leveraging answer-verifiable signals to guide policy optimization, which, however, suffers from high annotation costs. To alleviate this problem, recent work has explored unsupervised RLVR methods that derive rewards solely from the model's internal consistency, such as through entropy and majority voting. While seemingly promising, these methods often suffer from model collapse in the later stages of training, which may arise from the reinforcement of incorrect reasoning patterns in the absence of external supervision. In this work, we investigate a novel semi-supervised RLVR paradigm that utilizes a small labeled set to guide RLVR training on unlabeled samples. Our key insight is that supervised rewards are essential for stabilizing consistency-based training on unlabeled samples, ensuring that only reasoning patterns verified on labeled instances are incorporated into RL training. Technically, we propose an effective policy optimization algorithm, TraPO, that identifies reliable unlabeled samples by matching their learning trajectory similarity to labeled ones. Building on this, TraPO achieves remarkable data efficiency and strong generalization on six widely used mathematical reasoning benchmarks (AIME24/25, AMC, MATH-500, Minerva, and Olympiad) and three out-of-distribution tasks (ARC-c, GPQA-diamond, and MMLU-pro). With only 1K labeled and 3K unlabeled samples, TraPO reaches 42.6% average accuracy, surpassing the best unsupervised method trained on 45K unlabeled samples (38.3%). Notably, when using 4K labeled and 12K unlabeled samples, TraPO even outperforms the fully supervised model trained on the full 45K labeled samples on all benchmarks, while using only 10% of the labeled data. The code is available via https://github.com/ShenzhiYang2000/TRAPO.

TraPO：一种提升大语言模型推理能力的半监督强化学习框架

TraPO: A Semi-Supervised Reinforcement Learning Framework for Boosting LLM Reasoning

摘要

Support