ChatPaper.aiChatPaper

奖励罕见:面向LLM创造性问题解决的独特性感知强化学习

Rewarding the Rare: Uniqueness-Aware RL for Creative Problem Solving in LLMs

January 13, 2026
作者: Zhiyuan Hu, Yucheng Wang, Yufei He, Jiaying Wu, Yilun Zhao, See-Kiong Ng, Cynthia Breazeal, Anh Tuan Luu, Hae Won Park, Bryan Hooi
cs.AI

摘要

强化学习(RL)已成为大型语言模型(LLM)后训练的核心范式,尤其在复杂推理任务中表现突出,但其常面临探索坍缩问题:策略过早集中于少数主导推理模式,虽能提升单次采样通过率(pass@1),却限制了推演层级的多样性并制约了多次采样通过率(pass@k)的提升。我们认为这一问题的根源在于对局部令牌行为的正则化约束,而非对解决方案集合多样性的考量。为此,我们提出独特性感知强化学习——一种显式奖励采用罕见高层策略的正确解决方案的推演层级目标。该方法基于LLM的评判器对同一问题的推演结果进行高层策略聚类(忽略表面差异),并依据聚类规模对策略优势进行反向加权。由此,正确但新颖的策略将比冗余策略获得更高奖励。在数学、物理和医学推理基准测试中,本方法在大规模采样预算下持续提升pass@k指标,在保持pass@1不损失的同时提高pass@k曲线下面积(AUC@K),并通过持续探索在大规模应用中发掘出更多样的解决策略。
English
Reinforcement learning (RL) has become a central paradigm for post-training large language models (LLMs), particularly for complex reasoning tasks, yet it often suffers from exploration collapse: policies prematurely concentrate on a small set of dominant reasoning patterns, improving pass@1 while limiting rollout-level diversity and gains in pass@k. We argue that this failure stems from regularizing local token behavior rather than diversity over sets of solutions. To address this, we propose Uniqueness-Aware Reinforcement Learning, a rollout-level objective that explicitly rewards correct solutions that exhibit rare high-level strategies. Our method uses an LLM-based judge to cluster rollouts for the same problem according to their high-level solution strategies, ignoring superficial variations, and reweights policy advantages inversely with cluster size. As a result, correct but novel strategies receive higher rewards than redundant ones. Across mathematics, physics, and medical reasoning benchmarks, our approach consistently improves pass@k across large sampling budgets and increases the area under the pass@k curve (AUC@K) without sacrificing pass@1, while sustaining exploration and uncovering more diverse solution strategies at scale.
PDF1113January 17, 2026