过度自信误差需更强矫正：强化学习中的非对称置信度惩罚机制

摘要

可驗證獎勵強化學習（RLVR）已成為增強大型語言模型推理能力的主流範式。然而標準RLVR算法存在一個公認缺陷：雖然能通過銳化採樣提升Pass@1準確率，但會同時壓縮模型的推理邊界並降低生成多樣性。我們發現現有方法忽略的根本原因在於對錯誤的均質化懲罰——無論是基於難度篩選提示的數據過濾方法，還是優勢值歸一化方案，都對同一組內所有錯誤推演路徑施加相同懲罰。這種均質化處理使得過度自信錯誤（即被RL過程虛假強化的錯誤推理路徑）持續存在並壟斷概率質量，最終抑制有效的探索軌跡。為解決此問題，我們提出非對稱置信感知錯誤懲罰機制（ACE）。ACE引入逐條推演的置信度偏移量度c_i = log(π_θ(y_i|x) / π_ref(y_i|x))，以動態調節負優勢值。理論上我們證明ACE梯度可分解為僅作用於過度自信錯誤的選擇性正則項梯度，加上能部分調節正則項強度的良構殘差項。我們在VERL框架下使用GRPO與DAPO算法，對Qwen2.5-Math-7B、Qwen3-8B-Base及Llama-3.1-8B-Instruct模型在DAPO-Math-17K數據集上進行大量微調實驗。在MATH-500和AIME 2025基準測試中，ACE與現有方法無縫兼容，並在所有三類模型家族和測試集上一致提升全譜系Pass@k指標。

English

Reinforcement Learning with Verifiable Rewards (RLVR) has become the leading paradigm for enhancing reasoning in Large Language Models (LLMs). However, standard RLVR algorithms suffer from a well-documented pathology: while they improve Pass@1 accuracy through sharpened sampling, they simultaneously narrow the model's reasoning boundary and reduce generation diversity. We identify a root cause that existing methods overlook: the uniform penalization of errors. Current approaches -- whether data-filtering methods that select prompts by difficulty, or advantage normalization schemes -- treat all incorrect rollouts within a group identically. We show that this uniformity allows overconfident errors (incorrect reasoning paths that the RL process has spuriously reinforced) to persist and monopolize probability mass, ultimately suppressing valid exploratory trajectories. To address this, we propose the Asymmetric Confidence-aware Error Penalty (ACE). ACE introduces a per-rollout confidence shift metric, c_i = log(pi_theta(y_i|x) / pi_ref(y_i|x)), to dynamically modulate negative advantages. Theoretically, we demonstrate that ACE's gradient can be decomposed into the gradient of a selective regularizer restricted to overconfident errors, plus a well-characterized residual that partially moderates the regularizer's strength. We conduct extensive experiments fine-tuning Qwen2.5-Math-7B, Qwen3-8B-Base, and Llama-3.1-8B-Instruct on the DAPO-Math-17K dataset using GRPO and DAPO within the VERL framework. Evaluated on MATH-500 and AIME 2025, ACE composes seamlessly with existing methods and consistently improves the full Pass@k spectrum across all three model families and benchmarks.

过度自信误差需更强矫正：强化学习中的非对称置信度惩罚机制

Overconfident Errors Need Stronger Correction: Asymmetric Confidence Penalties for Reinforcement Learning

摘要

Support