單一領域能否助益其他領域?基於強化學習的多領域推理之數據中心研究
Can One Domain Help Others? A Data-Centric Study on Multi-Domain Reasoning via Reinforcement Learning
July 23, 2025
作者: Yu Li, Zhuoshi Pan, Honglin Lin, Mengyuan Sun, Conghui He, Lijun Wu
cs.AI
摘要
強化學習與可驗證獎勵(RLVR)已成為提升大型語言模型(LLMs)推理能力的強大範式。現有研究主要集中在孤立的推理領域,如數學問題解決、編碼任務或邏輯推理。然而,現實世界的推理場景本質上需要多種認知技能的綜合應用。儘管如此,這些推理技能在強化學習下的相互作用仍鮮為人知。為彌補這一差距,我們在RLVR框架內對多領域推理進行了系統性研究,明確聚焦於三個主要領域:數學推理、代碼生成和邏輯謎題解決。我們開展了一項全面研究,包含四個關鍵部分:(1)利用GRPO算法和Qwen-2.5-7B模型家族,我們深入評估了模型在單一領域數據集訓練下的領域內改進及跨領域泛化能力。(2)此外,我們探討了在跨領域聯合訓練中出現的複雜交互,包括相互促進與衝突。(3)為進一步理解SFT對RL的影響,我們還分析並比較了在相同RL配置下基礎模型與指令模型的性能差異。(4)進一步地,我們深入探討了RL訓練的關鍵細節,系統性地探索了課程學習策略、獎勵設計變異及語言特定因素的影響。通過大量實驗,我們的結果為領域交互的動態提供了重要見解,揭示了影響專業化與可泛化推理性能的關鍵因素。這些發現為優化RL方法論以培養LLMs全面、多領域推理能力提供了寶貴指導。
English
Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a
powerful paradigm for enhancing the reasoning capabilities of LLMs. Existing
research has predominantly concentrated on isolated reasoning domains such as
mathematical problem-solving, coding tasks, or logical reasoning. However, real
world reasoning scenarios inherently demand an integrated application of
multiple cognitive skills. Despite this, the interplay among these reasoning
skills under reinforcement learning remains poorly understood. To bridge this
gap, we present a systematic investigation of multi-domain reasoning within the
RLVR framework, explicitly focusing on three primary domains: mathematical
reasoning, code generation, and logical puzzle solving. We conduct a
comprehensive study comprising four key components: (1) Leveraging the GRPO
algorithm and the Qwen-2.5-7B model family, our study thoroughly evaluates the
models' in-domain improvements and cross-domain generalization capabilities
when trained on single-domain datasets. (2) Additionally, we examine the
intricate interactions including mutual enhancements and conflicts that emerge
during combined cross-domain training. (3) To further understand the influence
of SFT on RL, we also analyze and compare performance differences between base
and instruct models under identical RL configurations. (4) Furthermore, we
delve into critical RL training details, systematically exploring the impacts
of curriculum learning strategies, variations in reward design, and
language-specific factors. Through extensive experiments, our results offer
significant insights into the dynamics governing domain interactions, revealing
key factors influencing both specialized and generalizable reasoning
performance. These findings provide valuable guidance for optimizing RL
methodologies to foster comprehensive, multi-domain reasoning capabilities in
LLMs.