多樣性增強的主觀問題推理

摘要

具備長鏈思維（CoT）能力的大型推理模型（LRM）在數學推理和編程等客觀任務上展現出強大的性能。然而，在面對可能因不同視角而有多種答案的主觀問題時，其有效性仍受到同質化推理傾向的限制，這源於監督微調中對單一標準答案的依賴以及強化學習中可驗證獎勵的應用。基於增加角色視角能持續提升性能的發現，我們提出了MultiRole-R1，這是一個增強多樣性的框架，通過引入多角色視角來提高主觀推理任務的準確性和多樣性。MultiRole-R1特點在於其無監督數據構建流程，該流程生成包含多樣角色視角的推理鏈。我們進一步採用基於群體相對策略優化（GRPO）的強化學習，並通過獎勵塑造，將多樣性作為除可驗證獎勵外的額外獎勵信號。通過特別設計的獎勵函數，我們成功促進了視角多樣性和詞彙多樣性，揭示了推理多樣性與準確性之間的正相關關係。在六個基準測試上的實驗證明了MultiRole-R1在增強主觀和客觀推理方面的有效性和普適性，展示了多樣性增強訓練在LRMs中的潛力。

English

Large reasoning models (LRM) with long chain-of-thought (CoT) capabilities have shown strong performance on objective tasks, such as math reasoning and coding. However, their effectiveness on subjective questions that may have different responses from different perspectives is still limited by a tendency towards homogeneous reasoning, introduced by the reliance on a single ground truth in supervised fine-tuning and verifiable reward in reinforcement learning. Motivated by the finding that increasing role perspectives consistently improves performance, we propose MultiRole-R1, a diversity-enhanced framework with multiple role perspectives, to improve the accuracy and diversity in subjective reasoning tasks. MultiRole-R1 features an unsupervised data construction pipeline that generates reasoning chains that incorporate diverse role perspectives. We further employ reinforcement learning via Group Relative Policy Optimization (GRPO) with reward shaping, by taking diversity as a reward signal in addition to the verifiable reward. With specially designed reward functions, we successfully promote perspective diversity and lexical diversity, uncovering a positive relation between reasoning diversity and accuracy. Our experiment on six benchmarks demonstrates MultiRole-R1's effectiveness and generalizability in enhancing both subjective and objective reasoning, showcasing the potential of diversity-enhanced training in LRMs.

多樣性增強的主觀問題推理

Diversity-Enhanced Reasoning for Subjective Questions

摘要

Support