一个领域能否助力其他领域?基于强化学习的多领域推理数据驱动研究
Can One Domain Help Others? A Data-Centric Study on Multi-Domain Reasoning via Reinforcement Learning
July 23, 2025
作者: Yu Li, Zhuoshi Pan, Honglin Lin, Mengyuan Sun, Conghui He, Lijun Wu
cs.AI
摘要
可验证奖励强化学习(RLVR)已成为提升大语言模型(LLMs)推理能力的重要范式。现有研究主要集中于数学解题、编程任务或逻辑推理等单一推理领域。然而,现实世界的推理场景本质上需要多种认知技能的综合运用。尽管如此,这些推理技能在强化学习下的相互作用仍鲜为人知。为填补这一空白,我们在RLVR框架内对多领域推理进行了系统性研究,特别聚焦于三大主要领域:数学推理、代码生成和逻辑谜题解决。本研究包含四个关键部分:(1)借助GRPO算法和Qwen-2.5-7B模型家族,我们全面评估了模型在单领域数据集训练下的领域内提升及跨领域泛化能力。(2)同时,我们探讨了跨领域联合训练中出现的复杂交互,包括相互促进与冲突。(3)为深入理解监督微调(SFT)对强化学习的影响,我们还分析并比较了基础模型与指令模型在相同RL配置下的性能差异。(4)此外,我们深入探究了RL训练的关键细节,系统性地探索了课程学习策略、奖励设计变化及语言特定因素的影响。通过大量实验,我们的结果为领域间交互的动态机制提供了重要洞见,揭示了影响专业化和通用化推理性能的关键因素。这些发现为优化RL方法以培养LLMs全面、多领域的推理能力提供了宝贵指导。
English
Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a
powerful paradigm for enhancing the reasoning capabilities of LLMs. Existing
research has predominantly concentrated on isolated reasoning domains such as
mathematical problem-solving, coding tasks, or logical reasoning. However, real
world reasoning scenarios inherently demand an integrated application of
multiple cognitive skills. Despite this, the interplay among these reasoning
skills under reinforcement learning remains poorly understood. To bridge this
gap, we present a systematic investigation of multi-domain reasoning within the
RLVR framework, explicitly focusing on three primary domains: mathematical
reasoning, code generation, and logical puzzle solving. We conduct a
comprehensive study comprising four key components: (1) Leveraging the GRPO
algorithm and the Qwen-2.5-7B model family, our study thoroughly evaluates the
models' in-domain improvements and cross-domain generalization capabilities
when trained on single-domain datasets. (2) Additionally, we examine the
intricate interactions including mutual enhancements and conflicts that emerge
during combined cross-domain training. (3) To further understand the influence
of SFT on RL, we also analyze and compare performance differences between base
and instruct models under identical RL configurations. (4) Furthermore, we
delve into critical RL training details, systematically exploring the impacts
of curriculum learning strategies, variations in reward design, and
language-specific factors. Through extensive experiments, our results offer
significant insights into the dynamics governing domain interactions, revealing
key factors influencing both specialized and generalizable reasoning
performance. These findings provide valuable guidance for optimizing RL
methodologies to foster comprehensive, multi-domain reasoning capabilities in
LLMs.