多樣性增強的主觀問題推理
Diversity-Enhanced Reasoning for Subjective Questions
July 27, 2025
作者: Yumeng Wang, Zhiyuan Fan, Jiayu Liu, Yi R. Fung
cs.AI
摘要
具備長鏈思維(CoT)能力的大型推理模型(LRM)在數學推理和編程等客觀任務上展現出強大的性能。然而,在面對可能因不同視角而有多種答案的主觀問題時,其有效性仍受到同質化推理傾向的限制,這源於監督微調中對單一標準答案的依賴以及強化學習中可驗證獎勵的應用。基於增加角色視角能持續提升性能的發現,我們提出了MultiRole-R1,這是一個增強多樣性的框架,通過引入多角色視角來提高主觀推理任務的準確性和多樣性。MultiRole-R1特點在於其無監督數據構建流程,該流程生成包含多樣角色視角的推理鏈。我們進一步採用基於群體相對策略優化(GRPO)的強化學習,並通過獎勵塑造,將多樣性作為除可驗證獎勵外的額外獎勵信號。通過特別設計的獎勵函數,我們成功促進了視角多樣性和詞彙多樣性,揭示了推理多樣性與準確性之間的正相關關係。在六個基準測試上的實驗證明了MultiRole-R1在增強主觀和客觀推理方面的有效性和普適性,展示了多樣性增強訓練在LRMs中的潛力。
English
Large reasoning models (LRM) with long chain-of-thought (CoT) capabilities
have shown strong performance on objective tasks, such as math reasoning and
coding. However, their effectiveness on subjective questions that may have
different responses from different perspectives is still limited by a tendency
towards homogeneous reasoning, introduced by the reliance on a single ground
truth in supervised fine-tuning and verifiable reward in reinforcement
learning. Motivated by the finding that increasing role perspectives
consistently improves performance, we propose MultiRole-R1, a
diversity-enhanced framework with multiple role perspectives, to improve the
accuracy and diversity in subjective reasoning tasks. MultiRole-R1 features an
unsupervised data construction pipeline that generates reasoning chains that
incorporate diverse role perspectives. We further employ reinforcement learning
via Group Relative Policy Optimization (GRPO) with reward shaping, by taking
diversity as a reward signal in addition to the verifiable reward. With
specially designed reward functions, we successfully promote perspective
diversity and lexical diversity, uncovering a positive relation between
reasoning diversity and accuracy. Our experiment on six benchmarks demonstrates
MultiRole-R1's effectiveness and generalizability in enhancing both subjective
and objective reasoning, showcasing the potential of diversity-enhanced
training in LRMs.