DelTA: 検証可能な報酬に基づく強化学習のための識別的トークンクレジット割り当て

要旨

検証可能な報酬による強化学習（RLVR）は、大規模言語モデルの推論能力を向上させる中心的な手法として登場した。その有効性にもかかわらず、応答レベルの報酬がどのようにトークンレベルの確率変化に変換されるかについては、依然として理解が不十分である。本稿では、RLVR更新の判別器的解釈を導入し、方策勾配の更新方向が暗黙的にトークン勾配ベクトルに対する線形判別器として機能し、それによって学習中にどのトークン確率が増加または減少するかを決定することを示す。標準的な系列レベルのRLVRでは、この判別器は、アドバンテージ重み付け平均によるトークン勾配ベクトルから形成された正側と負側のセントロイドから構築される。しかし、このようなセントロイドの構築は、フォーマットトークンなどの共有された高頻度パターンに支配され、高報酬応答と低報酬応答をよりよく区別する疎でありながら識別力のある方向を希薄化する可能性がある。この制限に対処するため、本稿ではDelTA（識別的トークンクレジット割り当て）を提案する。これは、トークン係数を推定し、側固有のトークン勾配方向を増幅し、共有されたり弱い識別方向を減衰させる手法である。これらの係数は自己正規化されたRLVRサロゲートを再重み付けし、効果的な側ごとのセントロイドをより対比的とすることで、RLVRの更新方向を再形成する。7つの数学ベンチマークにおいて、DelTAはQwen3-8B-BaseおよびQwen3-14B-Baseで、最も強力な同規模ベースラインをそれぞれ平均3.26ポイントおよび2.62ポイント上回った。コード生成、異なるバックボーン、ドメイン外評価における追加結果も、DelTAの汎化能力を示している。

English

Reinforcement learning from verifiable rewards (RLVR) has emerged as a central technique for improving the reasoning capabilities of large language models. Despite its effectiveness, how response-level rewards translate into token-level probability changes remains poorly understood. We introduce a discriminator view of RLVR updates, showing that the policy-gradient update direction implicitly acts as a linear discriminator over token-gradient vectors and thereby determines which token probabilities are increased or decreased during learning. Under standard sequence-level RLVR, this discriminator is constructed from positive- and negative-side centroids formed by advantage-weighted averaging of token-gradient vectors. However, such centroid construction can be dominated by shared high-frequency patterns, such as formatting tokens, diluting sparse yet discriminative directions that better distinguish high-reward responses from low-reward ones. To address this limitation, we propose DelTA, a discriminative token credit assignment method that estimates token coefficients to amplify side-specific token-gradient directions and downweight shared or weakly discriminative ones. These coefficients reweight a self-normalized RLVR surrogate, making the effective side-wise centroids more contrastive and thereby reshaping the RLVR update direction. On seven mathematical benchmarks, DelTA outperforms the strongest same-scale baselines by 3.26 and 2.62 average points on Qwen3-8B-Base and Qwen3-14B-Base, respectively. Additional results on code generation, a different backbone, and out-of-domain evaluations further demonstrate the generalization ability of DelTA.