多样性增强的主观问题推理

摘要

具备长链思维（CoT）能力的大型推理模型（LRM）在数学推理和编程等客观任务上展现了强劲性能。然而，面对可能因视角不同而答案各异的主观问题时，其效能仍受限于同质化推理倾向，这一倾向源于监督微调中对单一标准答案的依赖以及强化学习中可验证奖励的引入。鉴于增加角色视角能持续提升表现这一发现，我们提出了MultiRole-R1，一个融合多角色视角的多样性增强框架，旨在提升主观推理任务的准确性与多样性。MultiRole-R1采用无监督数据构建流程，生成包含多样化角色视角的推理链。此外，我们通过群体相对策略优化（GRPO）结合奖励塑造实施强化学习，将多样性作为除可验证奖励之外的额外奖励信号。借助精心设计的奖励函数，我们成功促进了视角多样性与词汇多样性，揭示了推理多样性与准确性之间的正向关联。在六个基准测试上的实验验证了MultiRole-R1在增强主客观推理方面的有效性和普适性，展现了多样性增强训练在LRM中的巨大潜力。

English

Large reasoning models (LRM) with long chain-of-thought (CoT) capabilities have shown strong performance on objective tasks, such as math reasoning and coding. However, their effectiveness on subjective questions that may have different responses from different perspectives is still limited by a tendency towards homogeneous reasoning, introduced by the reliance on a single ground truth in supervised fine-tuning and verifiable reward in reinforcement learning. Motivated by the finding that increasing role perspectives consistently improves performance, we propose MultiRole-R1, a diversity-enhanced framework with multiple role perspectives, to improve the accuracy and diversity in subjective reasoning tasks. MultiRole-R1 features an unsupervised data construction pipeline that generates reasoning chains that incorporate diverse role perspectives. We further employ reinforcement learning via Group Relative Policy Optimization (GRPO) with reward shaping, by taking diversity as a reward signal in addition to the verifiable reward. With specially designed reward functions, we successfully promote perspective diversity and lexical diversity, uncovering a positive relation between reasoning diversity and accuracy. Our experiment on six benchmarks demonstrates MultiRole-R1's effectiveness and generalizability in enhancing both subjective and objective reasoning, showcasing the potential of diversity-enhanced training in LRMs.

多样性增强的主观问题推理

Diversity-Enhanced Reasoning for Subjective Questions

摘要

Support