過信エラーにはより強い修正を：強化学習における非対称的な信頼度ペナルティ

要旨

検証可能な報酬を用いた強化学習（RLVR）は、大規模言語モデル（LLM）の推論能力を向上させる主要なパラダイムとなっている。しかし、標準的なRLVRアルゴリズムにはよく知られた問題点がある：シャープ化されたサンプリングによってPass@1精度は向上する一方で、モデルの推論境界を狭め、生成多様性を減少させてしまうのである。我々は、既存手法が見落としている根本原因を特定した：誤りの均一なペナルティ化である。現在のアプローチ（難易度に基づくプロンプト選択を行うデータフィルタリング法やアドバンテージ正規化スキームなど）は、グループ内の全ての不正解ロールアウトを同一に扱う。我々は、この均一性が、過剰に自信のある誤り（RLプロセスが誤って強化した不正解の推論経路）を存続させ、確率質量を独占させることで、有効な探索的軌道を最終的に抑制することを示す。この問題に対処するため、非対称的な信頼度を考慮した誤りペナルティ（ACE）を提案する。ACEは、ロールアウト毎の信頼度シフト指標、c_i = log(pi_theta(y_i|x) / pi_ref(y_i|x)) を導入し、負のアドバンテージを動的に調整する。理論的には、ACEの勾配が、過剰に自信のある誤りに制限された選択的正則化項の勾配と、その強度を部分的に緩和するよく特徴付けられた残差項に分解できることを示す。VERLフレームワーク内でGRPOおよびDAPOを用い、DAPO-Math-17KデータセットでQwen2.5-Math-7B、Qwen3-8B-Base、Llama-3.1-8B-Instructをファインチューニングする大規模な実験を実施した。MATH-500およびAIME 2025による評価では、ACEは既存手法とシームレスに組み合わさり、3つのモデルファミリーとベンチマーク全てにおいて、完全なPass@kスペクトラムを一貫して改善した。

English

Reinforcement Learning with Verifiable Rewards (RLVR) has become the leading paradigm for enhancing reasoning in Large Language Models (LLMs). However, standard RLVR algorithms suffer from a well-documented pathology: while they improve Pass@1 accuracy through sharpened sampling, they simultaneously narrow the model's reasoning boundary and reduce generation diversity. We identify a root cause that existing methods overlook: the uniform penalization of errors. Current approaches -- whether data-filtering methods that select prompts by difficulty, or advantage normalization schemes -- treat all incorrect rollouts within a group identically. We show that this uniformity allows overconfident errors (incorrect reasoning paths that the RL process has spuriously reinforced) to persist and monopolize probability mass, ultimately suppressing valid exploratory trajectories. To address this, we propose the Asymmetric Confidence-aware Error Penalty (ACE). ACE introduces a per-rollout confidence shift metric, c_i = log(pi_theta(y_i|x) / pi_ref(y_i|x)), to dynamically modulate negative advantages. Theoretically, we demonstrate that ACE's gradient can be decomposed into the gradient of a selective regularizer restricted to overconfident errors, plus a well-characterized residual that partially moderates the regularizer's strength. We conduct extensive experiments fine-tuning Qwen2.5-Math-7B, Qwen3-8B-Base, and Llama-3.1-8B-Instruct on the DAPO-Math-17K dataset using GRPO and DAPO within the VERL framework. Evaluated on MATH-500 and AIME 2025, ACE composes seamlessly with existing methods and consistently improves the full Pass@k spectrum across all three model families and benchmarks.

過信エラーにはより強い修正を：強化学習における非対称的な信頼度ペナルティ

Overconfident Errors Need Stronger Correction: Asymmetric Confidence Penalties for Reinforcement Learning

要旨

Support