過去未曾消逝：記憶增強的動態獎勵塑形

摘要

儘管強化學習在大型語言模型領域已取得顯著成效，其常見的失效模式在於採樣多樣性降低——策略會反覆生成相似的錯誤行為。傳統的熵正則化方法雖能鼓勵當前策略下的隨機性，但無法明確抑制多次迭代中重複出現的失敗模式。為此，我們提出MEDS（記憶增強型動態獎勵塑形框架），該框架將歷史行為信號整合至獎勵設計中。通過存儲並利用模型的中間表徵，我們捕捉過往迭代的特徵，並採用基於密度的聚類方法識別頻繁重現的錯誤模式。被歸類至更常見錯誤集群的迭代會受到更嚴厲的懲罰，從而在減少重複錯誤的同時激發更廣泛的探索。在五個數據集和三種基礎模型上的實驗表明，MEDS始終優於現有基準方法，最高可提升4.13個pass@1分值和4.37個pass@128分值。進一步基於LLM標註與量化多樣性指標的分析證實，MEDS能有效提升採樣過程中的行為多樣性。

English

Despite the success of reinforcement learning for large language models, a common failure mode is reduced sampling diversity, where the policy repeatedly generates similar erroneous behaviors. Classical entropy regularization encourages randomness under the current policy, but does not explicitly discourage recurrent failure patterns across rollouts. We propose MEDS, a Memory-Enhanced Dynamic reward Shaping framework that incorporates historical behavioral signals into reward design. By storing and leveraging intermediate model representations, we capture features of past rollouts and use density-based clustering to identify frequently recurring error patterns. Rollouts assigned to more prevalent error clusters are penalized more heavily, encouraging broader exploration while reducing repeated mistakes. Across five datasets and three base models, MEDS consistently improves average performance over existing baselines, achieving gains of up to 4.13 pass@1 points and 4.37 pass@128 points. Additional analyses using both LLM-based annotations and quantitative diversity metrics show that MEDS increases behavioral diversity during sampling.

過去未曾消逝：記憶增強的動態獎勵塑形

The Past Is Not Past: Memory-Enhanced Dynamic Reward Shaping

摘要

Support