主観的質問に対する多様性強化型推論

要旨

長い連鎖思考（CoT）能力を持つ大規模推論モデル（LRM）は、数学的推論やコーディングなどの客観的タスクにおいて強力な性能を示しています。しかし、異なる視点から異なる回答が得られる可能性のある主観的な質問に対する有効性は、教師ありファインチューニングにおける単一の正解と強化学習における検証可能な報酬への依存によって導入される均質な推論の傾向によって制限されています。役割視点を増やすことが一貫して性能を向上させるという発見に動機づけられ、私たちはMultiRole-R1を提案します。これは、複数の役割視点を持つ多様性強化フレームワークであり、主観的推論タスクにおける精度と多様性を向上させます。MultiRole-R1は、多様な役割視点を取り入れた推論連鎖を生成する教師なしデータ構築パイプラインを特徴としています。さらに、検証可能な報酬に加えて多様性を報酬信号として取り入れるGroup Relative Policy Optimization（GRPO）による強化学習を採用しています。特別に設計された報酬関数により、視点の多様性と語彙の多様性を促進し、推論の多様性と精度の間に正の関係があることを明らかにしました。6つのベンチマークでの実験により、MultiRole-R1が主観的および客観的推論の両方を強化する有効性と汎用性を示し、LRMにおける多様性強化トレーニングの可能性を実証しました。

English

Large reasoning models (LRM) with long chain-of-thought (CoT) capabilities have shown strong performance on objective tasks, such as math reasoning and coding. However, their effectiveness on subjective questions that may have different responses from different perspectives is still limited by a tendency towards homogeneous reasoning, introduced by the reliance on a single ground truth in supervised fine-tuning and verifiable reward in reinforcement learning. Motivated by the finding that increasing role perspectives consistently improves performance, we propose MultiRole-R1, a diversity-enhanced framework with multiple role perspectives, to improve the accuracy and diversity in subjective reasoning tasks. MultiRole-R1 features an unsupervised data construction pipeline that generates reasoning chains that incorporate diverse role perspectives. We further employ reinforcement learning via Group Relative Policy Optimization (GRPO) with reward shaping, by taking diversity as a reward signal in addition to the verifiable reward. With specially designed reward functions, we successfully promote perspective diversity and lexical diversity, uncovering a positive relation between reasoning diversity and accuracy. Our experiment on six benchmarks demonstrates MultiRole-R1's effectiveness and generalizability in enhancing both subjective and objective reasoning, showcasing the potential of diversity-enhanced training in LRMs.