ReDit：獎勵抖動以提升大型語言模型策略優化

摘要

DeepSeek-R1 已成功通过其基于规则的奖励系统增强了大型语言模型（LLM）的推理能力。尽管这是一个“完美”的奖励系统，能有效防止奖励欺骗，但此类奖励函数往往是离散的。我们的实验观察表明，离散奖励可能导致梯度异常、优化不稳定以及收敛缓慢。为解决这一问题，我们提出了ReDit（奖励抖动），该方法通过添加简单的随机噪声来对离散奖励信号进行抖动处理。借助这种扰动后的奖励，学习过程中持续提供探索性梯度，从而实现更平滑的梯度更新并加速收敛。注入的噪声还在平坦奖励区域引入随机性，鼓励模型探索新策略并逃离局部最优。跨多种任务的实验验证了ReDit的有效性和效率。平均而言，ReDit仅需约10%的训练步数即可达到与原始GRPO相当的性能，且在训练时间相近的情况下，仍展现出比原始GRPO高出4%的性能提升。可视化结果证实了ReDit在显著缓解梯度问题方面的作用。此外，我们还提供了理论分析以进一步验证这些优势。

English

DeepSeek-R1 has successfully enhanced Large Language Model (LLM) reasoning capabilities through its rule-based reward system. While it's a ''perfect'' reward system that effectively mitigates reward hacking, such reward functions are often discrete. Our experimental observations suggest that discrete rewards can lead to gradient anomaly, unstable optimization, and slow convergence. To address this issue, we propose ReDit (Reward Dithering), a method that dithers the discrete reward signal by adding simple random noise. With this perturbed reward, exploratory gradients are continuously provided throughout the learning process, enabling smoother gradient updates and accelerating convergence. The injected noise also introduces stochasticity into flat reward regions, encouraging the model to explore novel policies and escape local optima. Experiments across diverse tasks demonstrate the effectiveness and efficiency of ReDit. On average, ReDit achieves performance comparable to vanilla GRPO with only approximately 10% the training steps, and furthermore, still exhibits a 4% performance improvement over vanilla GRPO when trained for a similar duration. Visualizations confirm significant mitigation of gradient issues with ReDit. Moreover, theoretical analyses are provided to further validate these advantages.

ReDit：獎勵抖動以提升大型語言模型策略優化

ReDit: Reward Dithering for Improved LLM Policy Optimization

摘要

Support