DelTA: 검증 가능한 보상으로부터의 강화 학습을 위한 변별적 토큰 신용 할당

초록

검증 가능한 보상을 통한 강화 학습(RLVR)은 대규모 언어 모델의 추론 능력을 향상시키는 핵심 기법으로 부상했다. 그 효과성에도 불구하고, 응답 수준의 보상이 토큰 수준의 확률 변화로 어떻게 변환되는지는 여전히 잘 이해되지 않고 있다. 우리는 RLVR 업데이트에 대한 판별기 관점을 도입하여, 정책 경사 업데이트 방향이 암묵적으로 토큰 경사 벡터에 대한 선형 판별기 역할을 수행함으로써 학습 중 어떤 토큰 확률이 증가하거나 감소하는지 결정함을 보인다. 표준적인 시퀀스 수준 RLVR 하에서 이 판별기는 이점 가중 평균을 통해 형성된 긍정 측과 부정 측 중심점으로 구성된다. 그러나 이러한 중심점 구성은 공유된 고빈도 패턴(예: 형식 토큰)에 의해 지배될 수 있으며, 이는 고보상 응답과 저보상 응답을 더 잘 구분하는 희소하면서도 식별력 있는 방향을 희석시킨다. 이러한 한계를 해결하기 위해, 우리는 토큰 계수를 추정하여 측별 토큰 경사 방향을 증폭하고 공유되거나 식별력이 약한 방향의 가중치를 낮추는 식별적 토큰 신용 할당 방법인 DelTA를 제안한다. 이 계수들은 자기 정규화된 RLVR 대리 함수를 재가중하여, 효과적인 측별 중심점을 더 대조적으로 만들어 RLVR 업데이트 방향을 재구성한다. 7개의 수학 벤치마크에서 DelTA는 Qwen3-8B-Base와 Qwen3-14B-Base에서 각각 가장 강력한 동일 규모 기준선보다 평균 3.26점과 2.62점 더 우수한 성능을 보였다. 코드 생성, 다른 백본, 도메인 외 평가에 대한 추가 결과는 DelTA의 일반화 능력을 추가로 입증한다.

English

Reinforcement learning from verifiable rewards (RLVR) has emerged as a central technique for improving the reasoning capabilities of large language models. Despite its effectiveness, how response-level rewards translate into token-level probability changes remains poorly understood. We introduce a discriminator view of RLVR updates, showing that the policy-gradient update direction implicitly acts as a linear discriminator over token-gradient vectors and thereby determines which token probabilities are increased or decreased during learning. Under standard sequence-level RLVR, this discriminator is constructed from positive- and negative-side centroids formed by advantage-weighted averaging of token-gradient vectors. However, such centroid construction can be dominated by shared high-frequency patterns, such as formatting tokens, diluting sparse yet discriminative directions that better distinguish high-reward responses from low-reward ones. To address this limitation, we propose DelTA, a discriminative token credit assignment method that estimates token coefficients to amplify side-specific token-gradient directions and downweight shared or weakly discriminative ones. These coefficients reweight a self-normalized RLVR surrogate, making the effective side-wise centroids more contrastive and thereby reshaping the RLVR update direction. On seven mathematical benchmarks, DelTA outperforms the strongest same-scale baselines by 3.26 and 2.62 average points on Qwen3-8B-Base and Qwen3-14B-Base, respectively. Additional results on code generation, a different backbone, and out-of-domain evaluations further demonstrate the generalization ability of DelTA.