과잉 확신 오류에는 더 강력한 수정이 필요하다: 강화 학습을 위한 비대칭 신뢰도 패널티

초록

검증 가능한 보상을 활용한 강화 학습(RLVR)은 대규모 언어 모델(LLM)의 추론 능력 향상을 위한 주요 패러다임으로 자리 잡았습니다. 그러나 표준 RLVR 알고리즘은 잘 알려진 한계를 지닙니다: 샘플링을 예리하게 만들어 Pass@1 정확도는 향상시키지만, 동시에 모델의 추론 경계를 좁히고 생성 다양성을 감소시킵니다. 우리는 기존 방법들이 간과한 근본 원인을 규명했는데, 바로 오류에 대한 균일한 처벌입니다. 난이도에 따라 프롬프트를 선택하는 데이터 필터링 방법이든, 어드밴티지 정규화 기법이든, 현재의 접근법들은 그룹 내 모든 잘못된 롤아웃을 동일하게 취급합니다. 우리는 이러한 균일성이 강화 학습 과정에서 허위로 강화된 과도하게 자신감 있는 오류들이 지속되고 확률 질량을 독점하게 하여, 결국 유효한 탐색 궤적을 억압한다는 것을 보여줍니다. 이를 해결하기 위해 우리는 비대칭 신뢰도 인식 오류 패널티(ACE)를 제안합니다. ACE는 롤아웃별 신뢰도 변화 지표(c_i = log(pi_theta(y_i|x) / pi_ref(y_i|x)))를 도입하여 부정적 어드밴티지를 동적으로 조절합니다. 이론적으로 우리는 ACE의 그래디언트가 과도하게 자신감 있는 오류에만 제한적으로 적용되는 선택적 정규화기의 그래디언트와, 해당 정규화기의 강도를 부분적으로 완화하는 잘 규정된 잔차로 분해될 수 있음을 증명합니다. 우리는 VERL 프레임워크 내에서 GRPO와 DAPO를 사용하여 Qwen2.5-Math-7B, Qwen3-8B-Base, Llama-3.1-8B-Instruct 모델을 DAPO-Math-17K 데이터셋으로 미세 조정하는 광범위한 실험을 수행했습니다. MATH-500과 AIME 2025에서 평가한 결과, ACE는 기존 방법과 원활하게 결합되어 세 가지 모델 패밀리와 모든 벤치마크에서 전체 Pass@k 스펙트럼을 일관되게 향상시켰습니다.

English

Reinforcement Learning with Verifiable Rewards (RLVR) has become the leading paradigm for enhancing reasoning in Large Language Models (LLMs). However, standard RLVR algorithms suffer from a well-documented pathology: while they improve Pass@1 accuracy through sharpened sampling, they simultaneously narrow the model's reasoning boundary and reduce generation diversity. We identify a root cause that existing methods overlook: the uniform penalization of errors. Current approaches -- whether data-filtering methods that select prompts by difficulty, or advantage normalization schemes -- treat all incorrect rollouts within a group identically. We show that this uniformity allows overconfident errors (incorrect reasoning paths that the RL process has spuriously reinforced) to persist and monopolize probability mass, ultimately suppressing valid exploratory trajectories. To address this, we propose the Asymmetric Confidence-aware Error Penalty (ACE). ACE introduces a per-rollout confidence shift metric, c_i = log(pi_theta(y_i|x) / pi_ref(y_i|x)), to dynamically modulate negative advantages. Theoretically, we demonstrate that ACE's gradient can be decomposed into the gradient of a selective regularizer restricted to overconfident errors, plus a well-characterized residual that partially moderates the regularizer's strength. We conduct extensive experiments fine-tuning Qwen2.5-Math-7B, Qwen3-8B-Base, and Llama-3.1-8B-Instruct on the DAPO-Math-17K dataset using GRPO and DAPO within the VERL framework. Evaluated on MATH-500 and AIME 2025, ACE composes seamlessly with existing methods and consistently improves the full Pass@k spectrum across all three model families and benchmarks.

과잉 확신 오류에는 더 강력한 수정이 필요하다: 강화 학습을 위한 비대칭 신뢰도 패널티

Overconfident Errors Need Stronger Correction: Asymmetric Confidence Penalties for Reinforcement Learning

초록

Support