ResRL: 부정 샘플 투영 잔여 강화 학습을 통한 대규모 언어 모델 추론 능력 향상

초록

검증 가능한 보상을 활용한 강화 학습(RLVR)은 대규모 언어 모델(LLM)의 추론 능력을 향상시키지만, 긍정적 보상에 대한 과도한 인센티브로 인해 일반적으로 생성 다양성이 제한되는 양상을 보인다. Negative Sample Reinforcement(NSR)와 같은 방법은 부정적 샘플에 대한 패널티를 강조하여 이 문제를 완화하지만, 긍정적 응답과 부정적 응답 간에 공유되는 의미론적 분포를 억제할 수 있다. 다양성을 잃지 않으면서 추론 능력을 향상시키기 위해, 본 논문은 긍정적 및 부정적 응답 간 유사한 의미론적 분포를 분리하는 부정적 샘플 투영 잔여 강화 학습(ResRL)을 제안한다. 우리는 Lazy Likelihood Displacement(LLD)를 부정-긍정 헤드 기울기 간섭과 이론적으로 연결하고, 표현 정렬의 상한을 규정하여 보수적 이점 재가중을 안내하는 단일 순방향 프록시를 유도한다. ResRL은 이후 부정적 토큰의 은닉 표현을 SVD 기반의 저차원 긍정적 부분 공간에 투영하고, 투영 잔차를 사용하여 부정적 기울기를 조절함으로써 다양성을 보존하면서 추론 능력을 개선하며, 수학, 코드, 에이전트 작업, 함수 호출 등 12개 벤치마크에 걸쳐 평균적으로 강력한 기준선을 능가한다. 특히 ResRL은 수학적 추론에서 NSR 대비 Avg@16 기준 9.4%, Pass@128 기준 7.0% 향상된 성능을 보인다. 코드는 https://github.com/1229095296/ResRL.git에서 이용 가능하다.

English

Reinforcement Learning with Verifiable Rewards (RLVR) enhances reasoning of Large Language Models (LLMs) but usually exhibits limited generation diversity due to the over-incentivization of positive rewards. Although methods like Negative Sample Reinforcement (NSR) mitigate this issue by upweighting penalty from negative samples, they may suppress the semantic distributions shared between positive and negative responses. To boost reasoning ability without losing diversity, this paper proposes negative sample projection Residual Reinforcement Learning (ResRL) that decouples similar semantic distributions among positive and negative responses. We theoretically link Lazy Likelihood Displacement (LLD) to negative-positive head-gradient interference and derive a single-forward proxy that upper-bounds representation alignment to guide conservative advantage reweighting. ResRL then projects negative-token hidden representations onto an SVD-based low-rank positive subspace and uses projection residuals to modulate negative gradients, improving reasoning while preserving diversity and outperforming strong baselines on average across twelve benchmarks spanning Mathematics, Code, Agent Tasks, and Function Calling. Notably, ResRL surpasses NSR on mathematical reasoning by 9.4\% in Avg@16 and 7.0\% in Pass@128. Code is available at https://github.com/1229095296/ResRL.git.

ResRL: 부정 샘플 투영 잔여 강화 학습을 통한 대규모 언어 모델 추론 능력 향상

ResRL: Boosting LLM Reasoning via Negative Sample Projection Residual Reinforcement Learning

초록

Support