獎勵稀缺性:面向LLM創意解題的獨特性感知強化學習
Rewarding the Rare: Uniqueness-Aware RL for Creative Problem Solving in LLMs
January 13, 2026
作者: Zhiyuan Hu, Yucheng Wang, Yufei He, Jiaying Wu, Yilun Zhao, See-Kiong Ng, Cynthia Breazeal, Anh Tuan Luu, Hae Won Park, Bryan Hooi
cs.AI
摘要
強化學習(RL)已成為大型語言模型(LLM)後訓練的核心範式,尤其在複雜推理任務中,但其常遭遇探索崩塌問題:策略過早集中於少數主導推理模式,雖能提升單次通過率(pass@1),卻限制了推演層級的多樣性與多次通過率(pass@k)的增益。我們認為此問題源於對局部詞元行為的規整化,而非對解決方案集合多樣性的考量。為此,我們提出「獨特性感知強化學習」——一種推演層級的目標函數,能明確獎勵展現罕見高階策略的正確解法。該方法基於LLM的評判器,將同一問題的推演結果按高階解決策略聚類(忽略表面差異),並依聚類規模反向調整策略優勢權重。如此,正確但新穎的策略將比冗餘策略獲得更高獎勵。在數學、物理和醫學推理基準測試中,本方法於大規模取樣預算下持續提升pass@k,並在維持pass@1的同時提高pass@k曲線下面積(AUC@K),同時保持探索能力,發掘出更多樣化的規模化解決策略。
English
Reinforcement learning (RL) has become a central paradigm for post-training large language models (LLMs), particularly for complex reasoning tasks, yet it often suffers from exploration collapse: policies prematurely concentrate on a small set of dominant reasoning patterns, improving pass@1 while limiting rollout-level diversity and gains in pass@k. We argue that this failure stems from regularizing local token behavior rather than diversity over sets of solutions. To address this, we propose Uniqueness-Aware Reinforcement Learning, a rollout-level objective that explicitly rewards correct solutions that exhibit rare high-level strategies. Our method uses an LLM-based judge to cluster rollouts for the same problem according to their high-level solution strategies, ignoring superficial variations, and reweights policy advantages inversely with cluster size. As a result, correct but novel strategies receive higher rewards than redundant ones. Across mathematics, physics, and medical reasoning benchmarks, our approach consistently improves pass@k across large sampling budgets and increases the area under the pass@k curve (AUC@K) without sacrificing pass@1, while sustaining exploration and uncovering more diverse solution strategies at scale.