從跨領域視角重新審視強化學習在大型語言模型推理中的應用
Revisiting Reinforcement Learning for LLM Reasoning from A Cross-Domain Perspective
June 17, 2025
作者: Zhoujun Cheng, Shibo Hao, Tianyang Liu, Fan Zhou, Yutao Xie, Feng Yao, Yuexin Bian, Yonghao Zhuang, Nilabjo Dey, Yuheng Zha, Yi Gu, Kun Zhou, Yuqi Wang, Yuan Li, Richard Fan, Jianshu She, Chengqian Gao, Abulhair Saparov, Haonan Li, Taylor W. Killian, Mikhail Yurochkin, Zhengzhong Liu, Eric P. Xing, Zhiting Hu
cs.AI
摘要
強化學習(RL)已成為提升大型語言模型(LLM)推理能力的一種有前景的方法,然而大多數公開研究僅專注於數學和編碼領域,限制了我們對其在通用推理中廣泛適用性的理解。一個關鍵挑戰在於缺乏跨多樣推理領域的可靠且可擴展的RL獎勵信號。我們介紹了Guru,這是一個精心策劃的RL推理語料庫,包含92,000個可驗證的示例,涵蓋六個推理領域——數學、編碼、科學、邏輯、模擬和表格——每個領域都通過特定領域的獎勵設計、去重和過濾來構建,以確保RL訓練的可靠性和有效性。基於Guru,我們系統性地重新審視了RL在LLM推理中的既定發現,並觀察到跨領域的顯著差異。例如,雖然先前的研究表明RL主要從預訓練模型中引出已有知識,但我們的結果揭示了一種更為細緻的模式:在預訓練中常見的領域(數學、編碼、科學)容易受益於跨領域的RL訓練,而預訓練曝光有限的領域(邏輯、模擬和表格)則需要領域內訓練才能實現有意義的性能提升,這表明RL很可能促進真正的技能獲取。最後,我們展示了Guru-7B和Guru-32B,這兩個模型在公開數據RL訓練的開放模型中達到了最先進的性能,在我們的17項任務評估套件中,分別比最佳基線高出7.9%和6.7%,涵蓋六個推理領域。我們還展示了我們的模型有效地提高了其基礎模型的Pass@k性能,特別是在預訓練數據中不太可能出現的複雜任務上。我們發布了數據、模型、訓練和評估代碼,以促進通用推理,詳見:https://github.com/LLM360/Reasoning360。
English
Reinforcement learning (RL) has emerged as a promising approach to improve
large language model (LLM) reasoning, yet most open efforts focus narrowly on
math and code, limiting our understanding of its broader applicability to
general reasoning. A key challenge lies in the lack of reliable, scalable RL
reward signals across diverse reasoning domains. We introduce Guru, a curated
RL reasoning corpus of 92K verifiable examples spanning six reasoning
domains--Math, Code, Science, Logic, Simulation, and Tabular--each built
through domain-specific reward design, deduplication, and filtering to ensure
reliability and effectiveness for RL training. Based on Guru, we systematically
revisit established findings in RL for LLM reasoning and observe significant
variation across domains. For example, while prior work suggests that RL
primarily elicits existing knowledge from pretrained models, our results reveal
a more nuanced pattern: domains frequently seen during pretraining (Math, Code,
Science) easily benefit from cross-domain RL training, while domains with
limited pretraining exposure (Logic, Simulation, and Tabular) require in-domain
training to achieve meaningful performance gains, suggesting that RL is likely
to facilitate genuine skill acquisition. Finally, we present Guru-7B and
Guru-32B, two models that achieve state-of-the-art performance among open
models RL-trained with publicly available data, outperforming best baselines by
7.9% and 6.7% on our 17-task evaluation suite across six reasoning domains. We
also show that our models effectively improve the Pass@k performance of their
base models, particularly on complex tasks less likely to appear in pretraining
data. We release data, models, training and evaluation code to facilitate
general-purpose reasoning at: https://github.com/LLM360/Reasoning360