多样性增强的主观问题推理
Diversity-Enhanced Reasoning for Subjective Questions
July 27, 2025
作者: Yumeng Wang, Zhiyuan Fan, Jiayu Liu, Yi R. Fung
cs.AI
摘要
具备长链思维(CoT)能力的大型推理模型(LRM)在数学推理和编程等客观任务上展现了强劲性能。然而,面对可能因视角不同而答案各异的主观问题时,其效能仍受限于同质化推理倾向,这一倾向源于监督微调中对单一标准答案的依赖以及强化学习中可验证奖励的引入。鉴于增加角色视角能持续提升表现这一发现,我们提出了MultiRole-R1,一个融合多角色视角的多样性增强框架,旨在提升主观推理任务的准确性与多样性。MultiRole-R1采用无监督数据构建流程,生成包含多样化角色视角的推理链。此外,我们通过群体相对策略优化(GRPO)结合奖励塑造实施强化学习,将多样性作为除可验证奖励之外的额外奖励信号。借助精心设计的奖励函数,我们成功促进了视角多样性与词汇多样性,揭示了推理多样性与准确性之间的正向关联。在六个基准测试上的实验验证了MultiRole-R1在增强主客观推理方面的有效性和普适性,展现了多样性增强训练在LRM中的巨大潜力。
English
Large reasoning models (LRM) with long chain-of-thought (CoT) capabilities
have shown strong performance on objective tasks, such as math reasoning and
coding. However, their effectiveness on subjective questions that may have
different responses from different perspectives is still limited by a tendency
towards homogeneous reasoning, introduced by the reliance on a single ground
truth in supervised fine-tuning and verifiable reward in reinforcement
learning. Motivated by the finding that increasing role perspectives
consistently improves performance, we propose MultiRole-R1, a
diversity-enhanced framework with multiple role perspectives, to improve the
accuracy and diversity in subjective reasoning tasks. MultiRole-R1 features an
unsupervised data construction pipeline that generates reasoning chains that
incorporate diverse role perspectives. We further employ reinforcement learning
via Group Relative Policy Optimization (GRPO) with reward shaping, by taking
diversity as a reward signal in addition to the verifiable reward. With
specially designed reward functions, we successfully promote perspective
diversity and lexical diversity, uncovering a positive relation between
reasoning diversity and accuracy. Our experiment on six benchmarks demonstrates
MultiRole-R1's effectiveness and generalizability in enhancing both subjective
and objective reasoning, showcasing the potential of diversity-enhanced
training in LRMs.