강화 학습을 통한 대규모 언어 모델을 위한 인터리브 추론

초록

긴 사고 사슬(CoT)은 대규모 언어 모델(LLM)의 추론 능력을 크게 향상시킵니다. 그러나 광범위한 추론 흔적은 비효율성과 첫 토큰까지의 시간(TTFT) 증가로 이어집니다. 본 연구에서는 강화 학습(RL)을 활용하여 다중 홉 질문에 대해 사고와 답변을 교차적으로 수행하도록 추론 LLM을 유도하는 새로운 훈련 패러다임을 제안합니다. 모델이 본질적으로 교차 추론을 수행할 수 있는 능력을 가지고 있으며, 이를 RL을 통해 더욱 향상시킬 수 있음을 관찰했습니다. 간단하지만 효과적인 규칙 기반 보상을 도입하여 올바른 중간 단계를 장려함으로써, 교차 추론 중 생성된 중간 신호를 활용하여 정책 모델이 올바른 추론 경로로 유도되도록 합니다. 다섯 가지 다양한 데이터셋과 세 가지 RL 알고리즘(PPO, GRPO, REINFORCE++)을 통해 수행된 광범위한 실험은 외부 도구 없이도 기존의 사고-답변 추론 방식에 비해 일관된 개선을 보여줍니다. 특히, 본 접근법은 TTFT를 평균 80% 이상 감소시키고 Pass@1 정확도를 최대 19.3% 향상시킵니다. 또한, 질문 응답 및 논리적 추론 데이터셋만으로 훈련된 본 방법은 MATH, GPQA, MMLU와 같은 복잡한 추론 데이터셋에 대해 강력한 일반화 능력을 보여줍니다. 추가적으로, 조건부 보상 모델링에 대한 몇 가지 유용한 통찰을 밝히기 위해 심층 분석을 수행합니다.

English

Long chain-of-thought (CoT) significantly enhances large language models' (LLM) reasoning capabilities. However, the extensive reasoning traces lead to inefficiencies and an increased time-to-first-token (TTFT). We propose a novel training paradigm that uses reinforcement learning (RL) to guide reasoning LLMs to interleave thinking and answering for multi-hop questions. We observe that models inherently possess the ability to perform interleaved reasoning, which can be further enhanced through RL. We introduce a simple yet effective rule-based reward to incentivize correct intermediate steps, which guides the policy model toward correct reasoning paths by leveraging intermediate signals generated during interleaved reasoning. Extensive experiments conducted across five diverse datasets and three RL algorithms (PPO, GRPO, and REINFORCE++) demonstrate consistent improvements over traditional think-answer reasoning, without requiring external tools. Specifically, our approach reduces TTFT by over 80% on average and improves up to 19.3% in Pass@1 accuracy. Furthermore, our method, trained solely on question answering and logical reasoning datasets, exhibits strong generalization ability to complex reasoning datasets such as MATH, GPQA, and MMLU. Additionally, we conduct in-depth analysis to reveal several valuable insights into conditional reward modeling.

강화 학습을 통한 대규모 언어 모델을 위한 인터리브 추론

Interleaved Reasoning for Large Language Models via Reinforcement Learning

초록

Support