한 도메인이 다른 도메인에 도움을 줄 수 있는가? 강화 학습을 통한 다중 도메인 추론에 대한 데이터 중심 연구

초록

검증 가능한 보상을 활용한 강화 학습(Reinforcement Learning with Verifiable Rewards, RLVR)은 대규모 언어 모델(LLM)의 추론 능력을 향상시키는 강력한 패러다임으로 부상하고 있다. 기존 연구는 주로 수학 문제 해결, 코딩 작업, 논리적 추론과 같은 독립적인 추론 영역에 집중해왔다. 그러나 실제 세계의 추론 시나리오는 본질적으로 여러 인지 능력을 통합적으로 적용할 것을 요구한다. 그럼에도 불구하고, 강화 학습 하에서 이러한 추론 능력 간의 상호작용은 여전히 잘 이해되지 않고 있다. 이러한 격차를 해소하기 위해, 우리는 RLVR 프레임워크 내에서 다중 도메인 추론에 대한 체계적인 연구를 제시하며, 특히 수학적 추론, 코드 생성, 논리 퍼즐 해결이라는 세 가지 주요 도메인에 초점을 맞춘다. 우리는 네 가지 핵심 구성 요소를 포함한 포괄적인 연구를 수행한다: (1) GRPO 알고리즘과 Qwen-2.5-7B 모델 패밀리를 활용하여, 단일 도메인 데이터셋으로 학습된 모델의 도메인 내 개선 및 도메인 간 일반화 능력을 철저히 평가한다. (2) 또한, 교차 도메인 학습 중에 발생하는 상호 강화 및 충돌을 포함한 복잡한 상호작용을 조사한다. (3) SFT(Supervised Fine-Tuning)가 강화 학습에 미치는 영향을 더 깊이 이해하기 위해, 동일한 RL 설정 하에서 기본 모델과 지시 모델 간의 성능 차이를 분석하고 비교한다. (4) 더 나아가, 커리큘럼 학습 전략, 보상 설계의 변형, 언어별 요소와 같은 중요한 RL 학습 세부 사항을 체계적으로 탐구한다. 광범위한 실험을 통해, 우리의 결과는 도메인 간 상호작용을 지배하는 역학에 대한 중요한 통찰을 제공하며, 특화된 추론 성능과 일반화 가능한 추론 성능에 영향을 미치는 핵심 요소를 밝힌다. 이러한 발견들은 LLM의 포괄적이고 다중 도메인 추론 능력을 육성하기 위해 RL 방법론을 최적화하는 데 유용한 지침을 제공한다.

English

Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a powerful paradigm for enhancing the reasoning capabilities of LLMs. Existing research has predominantly concentrated on isolated reasoning domains such as mathematical problem-solving, coding tasks, or logical reasoning. However, real world reasoning scenarios inherently demand an integrated application of multiple cognitive skills. Despite this, the interplay among these reasoning skills under reinforcement learning remains poorly understood. To bridge this gap, we present a systematic investigation of multi-domain reasoning within the RLVR framework, explicitly focusing on three primary domains: mathematical reasoning, code generation, and logical puzzle solving. We conduct a comprehensive study comprising four key components: (1) Leveraging the GRPO algorithm and the Qwen-2.5-7B model family, our study thoroughly evaluates the models' in-domain improvements and cross-domain generalization capabilities when trained on single-domain datasets. (2) Additionally, we examine the intricate interactions including mutual enhancements and conflicts that emerge during combined cross-domain training. (3) To further understand the influence of SFT on RL, we also analyze and compare performance differences between base and instruct models under identical RL configurations. (4) Furthermore, we delve into critical RL training details, systematically exploring the impacts of curriculum learning strategies, variations in reward design, and language-specific factors. Through extensive experiments, our results offer significant insights into the dynamics governing domain interactions, revealing key factors influencing both specialized and generalizable reasoning performance. These findings provide valuable guidance for optimizing RL methodologies to foster comprehensive, multi-domain reasoning capabilities in LLMs.

한 도메인이 다른 도메인에 도움을 줄 수 있는가? 강화 학습을 통한 다중 도메인 추론에 대한 데이터 중심 연구

Can One Domain Help Others? A Data-Centric Study on Multi-Domain Reasoning via Reinforcement Learning

초록

Support