大型语言模型能否在噪声监督下学会稳健推理？

摘要

基于可验证奖励的强化学习（RLVR）能有效训练依赖大量完美标签的推理模型，但其在专家稀缺导致不可避免的噪声标签下的脆弱性仍未得到充分探索。本研究首次对RLVR中的噪声标签机制进行系统性分析。与监督分类不同，多数RLVR算法包含基于轨迹生成的条件：标签对训练的影响取决于当前策略能否生成实现该标签的轨迹，这一特性自然延伸至噪声标签。基于此，我们区分两类噪声：非活跃噪声标签会降低数据效率，而活跃噪声标签会被强化并可能使模型偏向错误分布。通过噪声样本训练实验，我们发现早期正确性协同现象：尽管噪声样本在训练后期表现滞后，但早期阶段干净样本与噪声样本的准确率提升幅度相似。受此动态特性启发，我们提出在线标签优化（OLR）方法，当满足两个条件时（多数答案的轨迹通过率呈正增长趋势、历史答案在迭代中保持稳定），通过多数投票答案逐步修正潜在噪声标签，实现策略改进过程中的渐进式自校正。我们在六个同分布数学推理基准（AIME24/25、AMC、MATH-500、Minerva和Olympiad）和三个分布外任务（ARC-c、GPQA-diamond和MMLU-pro）上评估OLR。在0.1至0.9的噪声比例范围内，OLR在非活跃与活跃噪声标签设置下均能提升模型鲁棒性，同分布基准平均提升3.6%-3.9%，分布外评估平均提升3.3%-4.6%。

English

Reinforcement Learning with Verifiable Rewards (RLVR) effectively trains reasoning models that rely on abundant perfect labels, but its vulnerability to unavoidable noisy labels due to expert scarcity remains critically underexplored. In this work, we take the first step toward a systematic analysis of noisy label mechanisms in RLVR. In contrast to supervised classification, most RLVR algorithms incorporate a rollout-based condition: a label's influence on training is contingent on whether the current policy can generate rollouts that realize it, a property that naturally extends to noisy labels. Based on this observation, we distinguish two types of noise: inactive noisy labels, which reduce data efficiency, and active noisy labels, which are reinforced and risk skewing the model toward incorrect distributions. From experiments on training with noisy samples, we identify an Early Correctness Coherence phenomenon: although noisy samples begin to lag behind in later stages, accuracy on both clean and noisy samples increases similarly in early training. Motivated by this dynamic, we propose Online Label Refinement (OLR), which progressively corrects potentially noisy labels with majority-voted answers when two conditions hold: a positive slope in the majority answer's rollout pass rate and stable historical consistency across updates, enabling gradual self-correction as the policy improves. We evaluate OLR on six in-distribution mathematical reasoning benchmarks (AIME24/25, AMC, MATH-500, Minerva, and Olympiad) and three out-of-distribution tasks (ARC-c, GPQA-diamond, and MMLU-pro). Across noise ratios from 0.1 to 0.9, OLR consistently improves robustness under both inactive and active noisy-label settings, achieving average gains of 3.6% to 3.9% on in-distribution benchmarks and 3.3% to 4.6% on out-of-distribution evaluations.

大型语言模型能否在噪声监督下学会稳健推理？

Can LLMs Learn to Reason Robustly under Noisy Supervision?

摘要

Support