从跨领域视角重新审视大语言模型推理中的强化学习
Revisiting Reinforcement Learning for LLM Reasoning from A Cross-Domain Perspective
June 17, 2025
作者: Zhoujun Cheng, Shibo Hao, Tianyang Liu, Fan Zhou, Yutao Xie, Feng Yao, Yuexin Bian, Yonghao Zhuang, Nilabjo Dey, Yuheng Zha, Yi Gu, Kun Zhou, Yuqi Wang, Yuan Li, Richard Fan, Jianshu She, Chengqian Gao, Abulhair Saparov, Haonan Li, Taylor W. Killian, Mikhail Yurochkin, Zhengzhong Liu, Eric P. Xing, Zhiting Hu
cs.AI
摘要
强化学习(RL)已成为提升大型语言模型(LLM)推理能力的一种有前景的方法,然而,大多数公开研究仅局限于数学与编程领域,限制了我们对其在通用推理中更广泛适用性的理解。一个核心挑战在于,跨多样推理领域缺乏可靠且可扩展的RL奖励信号。为此,我们推出了Guru,一个精心构建的RL推理语料库,包含92,000个可验证示例,覆盖数学、编程、科学、逻辑、模拟及表格六大推理领域。每个领域均通过特定领域的奖励设计、去重与过滤流程构建,以确保RL训练的可靠性与有效性。基于Guru,我们系统性地重新审视了RL在LLM推理中的既定发现,并观察到跨领域的显著差异。例如,尽管先前研究认为RL主要激发预训练模型中的已有知识,但我们的结果揭示了一种更为微妙的现象:在预训练中频繁接触的领域(数学、编程、科学)能轻易受益于跨领域RL训练,而预训练接触较少的领域(逻辑、模拟、表格)则需进行领域内训练才能实现显著的性能提升,这表明RL很可能促进了真实技能的习得。最后,我们展示了Guru-7B和Guru-32B两款模型,在利用公开数据进行RL训练的开源模型中,它们达到了顶尖性能,在我们的17项任务评估套件中,分别超越最佳基线模型7.9%和6.7%,覆盖六大推理领域。我们还证明,这些模型有效提升了其基础模型的Pass@k性能,尤其是在预训练数据中较少出现的复杂任务上。我们已发布数据、模型、训练与评估代码,以促进通用推理研究,访问地址为:https://github.com/LLM360/Reasoning360。
English
Reinforcement learning (RL) has emerged as a promising approach to improve
large language model (LLM) reasoning, yet most open efforts focus narrowly on
math and code, limiting our understanding of its broader applicability to
general reasoning. A key challenge lies in the lack of reliable, scalable RL
reward signals across diverse reasoning domains. We introduce Guru, a curated
RL reasoning corpus of 92K verifiable examples spanning six reasoning
domains--Math, Code, Science, Logic, Simulation, and Tabular--each built
through domain-specific reward design, deduplication, and filtering to ensure
reliability and effectiveness for RL training. Based on Guru, we systematically
revisit established findings in RL for LLM reasoning and observe significant
variation across domains. For example, while prior work suggests that RL
primarily elicits existing knowledge from pretrained models, our results reveal
a more nuanced pattern: domains frequently seen during pretraining (Math, Code,
Science) easily benefit from cross-domain RL training, while domains with
limited pretraining exposure (Logic, Simulation, and Tabular) require in-domain
training to achieve meaningful performance gains, suggesting that RL is likely
to facilitate genuine skill acquisition. Finally, we present Guru-7B and
Guru-32B, two models that achieve state-of-the-art performance among open
models RL-trained with publicly available data, outperforming best baselines by
7.9% and 6.7% on our 17-task evaluation suite across six reasoning domains. We
also show that our models effectively improve the Pass@k performance of their
base models, particularly on complex tasks less likely to appear in pretraining
data. We release data, models, training and evaluation code to facilitate
general-purpose reasoning at: https://github.com/LLM360/Reasoning360