교차 도메인 관점에서 LLM 추론을 위한 강화 학습 재고찰

초록

강화 학습(Reinforcement Learning, RL)은 대형 언어 모델(Large Language Model, LLM)의 추론 능력을 향상시키기 위한 유망한 접근 방식으로 부상했지만, 대부분의 공개된 연구는 수학과 코드에만 초점을 맞추어 일반적인 추론에 대한 RL의 광범위한 적용 가능성을 이해하는 데 한계가 있습니다. 주요 과제는 다양한 추론 영역에서 신뢰할 수 있고 확장 가능한 RL 보상 신호의 부재에 있습니다. 우리는 Guru를 소개합니다. 이는 수학, 코드, 과학, 논리, 시뮬레이션, 표 형식 데이터 등 6가지 추론 영역에 걸쳐 92,000개의 검증 가능한 예시로 구성된 RL 추론 코퍼스로, 각 영역별로 도메인 특화적인 보상 설계, 중복 제거, 필터링을 통해 RL 훈련의 신뢰성과 효과를 보장합니다. Guru를 기반으로, 우리는 LLM 추론을 위한 RL에서 기존의 연구 결과를 체계적으로 재검토하고, 영역 간에 상당한 차이를 관찰했습니다. 예를 들어, 기존 연구에서는 RL이 주로 사전 훈련된 모델의 기존 지식을 이끌어낸다고 주장하지만, 우리의 결과는 더 미묘한 패턴을 보여줍니다: 사전 훈련 중 자주 접한 영역(수학, 코드, 과학)은 교차 도메인 RL 훈련에서 쉽게 이점을 얻는 반면, 사전 훈련에서 제한적으로 노출된 영역(논리, 시뮬레이션, 표 형식 데이터)은 의미 있는 성능 향상을 위해 도메인 내 훈련이 필요하며, 이는 RL이 진정한 기술 습득을 촉진할 가능성이 있음을 시사합니다. 마지막으로, 우리는 공개적으로 이용 가능한 데이터로 RL 훈련을 받은 오픈 모델 중에서 최첨단 성능을 달성한 Guru-7B와 Guru-32B 두 모델을 제시합니다. 이 모델들은 6가지 추론 영역에 걸친 17개 작업 평가 세트에서 최고의 기준선을 각각 7.9%와 6.7% 능가합니다. 또한, 우리의 모델이 기본 모델의 Pass@k 성능을 효과적으로 개선하며, 특히 사전 훈련 데이터에 덜 등장할 가능성이 높은 복잡한 작업에서 더 큰 개선을 보임을 확인했습니다. 우리는 일반적인 추론을 촉진하기 위해 데이터, 모델, 훈련 및 평가 코드를 https://github.com/LLM360/Reasoning360에서 공개합니다.

English

Reinforcement learning (RL) has emerged as a promising approach to improve large language model (LLM) reasoning, yet most open efforts focus narrowly on math and code, limiting our understanding of its broader applicability to general reasoning. A key challenge lies in the lack of reliable, scalable RL reward signals across diverse reasoning domains. We introduce Guru, a curated RL reasoning corpus of 92K verifiable examples spanning six reasoning domains--Math, Code, Science, Logic, Simulation, and Tabular--each built through domain-specific reward design, deduplication, and filtering to ensure reliability and effectiveness for RL training. Based on Guru, we systematically revisit established findings in RL for LLM reasoning and observe significant variation across domains. For example, while prior work suggests that RL primarily elicits existing knowledge from pretrained models, our results reveal a more nuanced pattern: domains frequently seen during pretraining (Math, Code, Science) easily benefit from cross-domain RL training, while domains with limited pretraining exposure (Logic, Simulation, and Tabular) require in-domain training to achieve meaningful performance gains, suggesting that RL is likely to facilitate genuine skill acquisition. Finally, we present Guru-7B and Guru-32B, two models that achieve state-of-the-art performance among open models RL-trained with publicly available data, outperforming best baselines by 7.9% and 6.7% on our 17-task evaluation suite across six reasoning domains. We also show that our models effectively improve the Pass@k performance of their base models, particularly on complex tasks less likely to appear in pretraining data. We release data, models, training and evaluation code to facilitate general-purpose reasoning at: https://github.com/LLM360/Reasoning360

교차 도메인 관점에서 LLM 추론을 위한 강화 학습 재고찰

Revisiting Reinforcement Learning for LLM Reasoning from A Cross-Domain Perspective

초록

Support