ReDit: 향상된 LLM 정책 최적화를 위한 보상 디더링

초록

DeepSeek-R1은 규칙 기반 보상 시스템을 통해 대규모 언어 모델(LLM)의 추론 능력을 성공적으로 향상시켰습니다. 이는 '완벽한' 보상 시스템으로, 보상 해킹을 효과적으로 완화하지만, 이러한 보상 함수는 종종 이산적입니다. 우리의 실험적 관찰에 따르면, 이산적 보상은 그래디언트 이상, 불안정한 최적화, 그리고 느린 수렴을 초래할 수 있습니다. 이 문제를 해결하기 위해, 우리는 ReDit(보상 디더링)을 제안합니다. 이 방법은 간단한 무작위 노이즈를 추가하여 이산적 보상 신호를 디더링합니다. 이렇게 교란된 보상을 통해 학습 과정 전반에 걸쳐 탐색적 그래디언트가 지속적으로 제공되어, 더 부드러운 그래디언트 업데이트와 빠른 수렴이 가능해집니다. 주입된 노이즈는 또한 평탄한 보상 영역에 확률성을 도입하여, 모델이 새로운 정책을 탐색하고 지역 최적점에서 벗어나도록 장려합니다. 다양한 작업에 걸친 실험은 ReDit의 효과성과 효율성을 입증합니다. 평균적으로, ReDit은 기존 GRPO와 비슷한 성능을 약 10%의 학습 단계로 달성하며, 더 나아가 비슷한 학습 기간 동안에도 기존 GRPO보다 4%의 성능 향상을 보입니다. 시각화는 ReDit을 통해 그래디언트 문제가 상당히 완화되었음을 확인시켜 줍니다. 또한, 이러한 장점을 추가로 검증하기 위한 이론적 분석이 제공됩니다.

English

DeepSeek-R1 has successfully enhanced Large Language Model (LLM) reasoning capabilities through its rule-based reward system. While it's a ''perfect'' reward system that effectively mitigates reward hacking, such reward functions are often discrete. Our experimental observations suggest that discrete rewards can lead to gradient anomaly, unstable optimization, and slow convergence. To address this issue, we propose ReDit (Reward Dithering), a method that dithers the discrete reward signal by adding simple random noise. With this perturbed reward, exploratory gradients are continuously provided throughout the learning process, enabling smoother gradient updates and accelerating convergence. The injected noise also introduces stochasticity into flat reward regions, encouraging the model to explore novel policies and escape local optima. Experiments across diverse tasks demonstrate the effectiveness and efficiency of ReDit. On average, ReDit achieves performance comparable to vanilla GRPO with only approximately 10% the training steps, and furthermore, still exhibits a 4% performance improvement over vanilla GRPO when trained for a similar duration. Visualizations confirm significant mitigation of gradient issues with ReDit. Moreover, theoretical analyses are provided to further validate these advantages.

ReDit: 향상된 LLM 정책 최적화를 위한 보상 디더링

ReDit: Reward Dithering for Improved LLM Policy Optimization

초록

Support