过度自信错误需更强修正：强化学习中的非对称置信度惩罚机制

摘要

可验证奖励强化学习（RLVR）已成为增强大语言模型推理能力的主流范式。然而标准RLVR算法存在一个公认的缺陷：虽然通过锐化采样提高了Pass@1准确率，却同时收窄了模型的推理边界并降低了生成多样性。我们发现现有方法忽视了一个根本原因——对错误的均匀惩罚机制。无论是按难度筛选提示的数据过滤方法，还是优势值归一化方案，当前方法都对同一组内的错误推理路径进行无差别处理。这种均匀性使得过度自信错误（被RL过程虚假强化的错误推理路径）持续存在并垄断概率质量，最终压制了有效的探索轨迹。针对此问题，我们提出非对称置信感知错误惩罚机制（ACE）。该方法通过每个推理路径的置信度偏移量c_i = log(π_θ(y_i|x) / π_ref(y_i|x))动态调节负优势值。理论分析表明，ACE梯度可分解为仅限于过度自信错误的选择性正则项梯度，加上一个能部分调节正则项强度的良定残差项。我们在VERL框架下使用GRPO和DAPO对Qwen2.5-Math-7B、Qwen3-8B-Base及Llama-3.1-8B-Instruct模型进行DAPO-Math-17K数据集的微调实验。在MATH-500和AIME 2025基准测试中，ACE与现有方法无缝兼容，持续提升所有三个模型族在全Pass@k谱系上的表现。

English

Reinforcement Learning with Verifiable Rewards (RLVR) has become the leading paradigm for enhancing reasoning in Large Language Models (LLMs). However, standard RLVR algorithms suffer from a well-documented pathology: while they improve Pass@1 accuracy through sharpened sampling, they simultaneously narrow the model's reasoning boundary and reduce generation diversity. We identify a root cause that existing methods overlook: the uniform penalization of errors. Current approaches -- whether data-filtering methods that select prompts by difficulty, or advantage normalization schemes -- treat all incorrect rollouts within a group identically. We show that this uniformity allows overconfident errors (incorrect reasoning paths that the RL process has spuriously reinforced) to persist and monopolize probability mass, ultimately suppressing valid exploratory trajectories. To address this, we propose the Asymmetric Confidence-aware Error Penalty (ACE). ACE introduces a per-rollout confidence shift metric, c_i = log(pi_theta(y_i|x) / pi_ref(y_i|x)), to dynamically modulate negative advantages. Theoretically, we demonstrate that ACE's gradient can be decomposed into the gradient of a selective regularizer restricted to overconfident errors, plus a well-characterized residual that partially moderates the regularizer's strength. We conduct extensive experiments fine-tuning Qwen2.5-Math-7B, Qwen3-8B-Base, and Llama-3.1-8B-Instruct on the DAPO-Math-17K dataset using GRPO and DAPO within the VERL framework. Evaluated on MATH-500 and AIME 2025, ACE composes seamlessly with existing methods and consistently improves the full Pass@k spectrum across all three model families and benchmarks.