ReDit：通过奖励抖动优化LLM策略

摘要

DeepSeek-R1通过其基于规则的奖励机制，成功提升了大型语言模型（LLM）的推理能力。尽管这是一个“完美”的奖励系统，能有效防止奖励滥用，但此类奖励函数往往具有离散性。我们的实验观察表明，离散奖励可能导致梯度异常、优化不稳定及收敛缓慢。为解决这一问题，我们提出了ReDit（奖励抖动）方法，通过添加简单的随机噪声对离散奖励信号进行抖动处理。借助这种扰动后的奖励，学习过程中持续提供探索性梯度，从而实现更平滑的梯度更新并加速收敛。引入的噪声还在平坦奖励区域引入随机性，激励模型探索新策略，逃离局部最优。跨多种任务的实验验证了ReDit的有效性和效率。平均而言，ReDit仅需约10%的训练步数即可达到与标准GRPO相当的性能，且在训练时长相近时，仍展现出4%的性能提升。可视化结果证实了ReDit在显著缓解梯度问题方面的作用。此外，理论分析进一步验证了这些优势。

English

DeepSeek-R1 has successfully enhanced Large Language Model (LLM) reasoning capabilities through its rule-based reward system. While it's a ''perfect'' reward system that effectively mitigates reward hacking, such reward functions are often discrete. Our experimental observations suggest that discrete rewards can lead to gradient anomaly, unstable optimization, and slow convergence. To address this issue, we propose ReDit (Reward Dithering), a method that dithers the discrete reward signal by adding simple random noise. With this perturbed reward, exploratory gradients are continuously provided throughout the learning process, enabling smoother gradient updates and accelerating convergence. The injected noise also introduces stochasticity into flat reward regions, encouraging the model to explore novel policies and escape local optima. Experiments across diverse tasks demonstrate the effectiveness and efficiency of ReDit. On average, ReDit achieves performance comparable to vanilla GRPO with only approximately 10% the training steps, and furthermore, still exhibits a 4% performance improvement over vanilla GRPO when trained for a similar duration. Visualizations confirm significant mitigation of gradient issues with ReDit. Moreover, theoretical analyses are provided to further validate these advantages.

ReDit：通过奖励抖动优化LLM策略

ReDit: Reward Dithering for Improved LLM Policy Optimization

摘要

Support