주관적 질문을 위한 다양성 강화 추론

초록

긴 사고 연쇄(Chain-of-Thought, CoT) 능력을 갖춘 대규모 추론 모델(Large Reasoning Models, LRM)은 수학적 추론 및 코딩과 같은 객관적 작업에서 강력한 성능을 보여왔습니다. 그러나 다양한 관점에서 다른 답변이 나올 수 있는 주관적 질문에 대한 효과성은 여전히 지도 학습에서의 단일 정답과 강화 학습에서의 검증 가능한 보상에 의존함으로써 발생하는 동질적 추론 경향에 의해 제한되고 있습니다. 다양한 역할 관점을 증가시키는 것이 일관적으로 성능을 향상시킨다는 발견에 동기를 받아, 우리는 주관적 추론 작업에서 정확성과 다양성을 개선하기 위해 다중 역할 관점을 갖춘 다양성 강화 프레임워크인 MultiRole-R1을 제안합니다. MultiRole-R1은 다양한 역할 관점을 통합한 추론 연쇄를 생성하는 비지도 데이터 구성 파이프라인을 특징으로 합니다. 또한, 우리는 검증 가능한 보상 외에 다양성을 보상 신호로 사용하여 그룹 상대 정책 최적화(Group Relative Policy Optimization, GRPO)와 보상 형성을 통해 강화 학습을 추가로 적용합니다. 특별히 설계된 보상 함수를 통해, 우리는 관점 다양성과 어휘 다양성을 성공적으로 촉진하며, 추론 다양성과 정확성 간의 긍정적인 관계를 발견했습니다. 6개의 벤치마크에서의 실험은 MultiRole-R1이 주관적 및 객관적 추론 모두를 향상시키는 데 있어 효과적이고 일반화 가능함을 입증하며, LRM에서 다양성 강화 훈련의 잠재력을 보여줍니다.

English

Large reasoning models (LRM) with long chain-of-thought (CoT) capabilities have shown strong performance on objective tasks, such as math reasoning and coding. However, their effectiveness on subjective questions that may have different responses from different perspectives is still limited by a tendency towards homogeneous reasoning, introduced by the reliance on a single ground truth in supervised fine-tuning and verifiable reward in reinforcement learning. Motivated by the finding that increasing role perspectives consistently improves performance, we propose MultiRole-R1, a diversity-enhanced framework with multiple role perspectives, to improve the accuracy and diversity in subjective reasoning tasks. MultiRole-R1 features an unsupervised data construction pipeline that generates reasoning chains that incorporate diverse role perspectives. We further employ reinforcement learning via Group Relative Policy Optimization (GRPO) with reward shaping, by taking diversity as a reward signal in addition to the verifiable reward. With specially designed reward functions, we successfully promote perspective diversity and lexical diversity, uncovering a positive relation between reasoning diversity and accuracy. Our experiment on six benchmarks demonstrates MultiRole-R1's effectiveness and generalizability in enhancing both subjective and objective reasoning, showcasing the potential of diversity-enhanced training in LRMs.