往昔未逝：记忆增强型动态奖励塑形

摘要

尽管强化学习在大语言模型中取得了成功，但其常见失效模式是采样多样性降低——策略会反复生成相似的错误行为。经典熵正则化方法虽能鼓励当前策略下的随机性，但无法显式抑制多轮迭代中反复出现的失败模式。我们提出MEDS框架（记忆增强型动态奖励塑造），将历史行为信号融入奖励设计。通过存储并利用模型的中间表征，我们捕捉过往迭代轨迹的特征，并采用基于密度的聚类方法识别频繁重现的错误模式。被归入更普遍错误簇的迭代轨迹会受到更严厉的惩罚，从而在减少重复错误的同时鼓励更广泛的探索。在五个数据集和三个基础模型上的实验表明，MEDS始终优于现有基线方法，平均性能提升最高达4.13个pass@1点和4.37个pass@128点。基于大语言模型的标注分析和定量多样性指标均显示，MEDS能有效提升采样过程中的行为多样性。

English

Despite the success of reinforcement learning for large language models, a common failure mode is reduced sampling diversity, where the policy repeatedly generates similar erroneous behaviors. Classical entropy regularization encourages randomness under the current policy, but does not explicitly discourage recurrent failure patterns across rollouts. We propose MEDS, a Memory-Enhanced Dynamic reward Shaping framework that incorporates historical behavioral signals into reward design. By storing and leveraging intermediate model representations, we capture features of past rollouts and use density-based clustering to identify frequently recurring error patterns. Rollouts assigned to more prevalent error clusters are penalized more heavily, encouraging broader exploration while reducing repeated mistakes. Across five datasets and three base models, MEDS consistently improves average performance over existing baselines, achieving gains of up to 4.13 pass@1 points and 4.37 pass@128 points. Additional analyses using both LLM-based annotations and quantitative diversity metrics show that MEDS increases behavioral diversity during sampling.