달리기 전에 걷자! 강화 학습을 통한 간결한 LLM 추론

초록

테스트 타임 스케일링이 대규모 언어 모델(LLM) 개발의 핵심 연구 분야로 부상함에 따라, 최신 및 고급 사후 훈련 방법론들은 점점 더 긴 사고의 연쇄(Chain-of-Thought, CoT) 응답의 생성 길이를 확장하여 DeepSeek R1과 유사한 성능으로의 추론 능력을 향상시키는 데 초점을 맞추고 있습니다. 그러나 최근 연구들은 최첨단 추론 모델에서 지속적으로 나타나는 과도한 사고 현상을 밝혀냈는데, 이는 긴 CoT 응답에서 과도한 중복성이나 반복적인 사고 패턴으로 나타납니다. 이 문제를 해결하기 위해, 본 논문에서는 간결한 추론을 달성하기 위한 간단하지만 효과적인 2단계 강화 학습 프레임워크인 ConciseR을 제안합니다. 구체적으로, 첫 번째 단계는 더 많은 훈련 단계를 사용하여 클립-하이어 및 동적 샘플링 구성 요소가 포함된 그룹 상대 정책 최적화(GRPO++)를 통해 모델의 추론 능력을 강화하는 것을 목표로 하고, 두 번째 단계는 더 적은 훈련 단계를 사용하여 길이 인식 그룹 상대 정책 최적화(L-GRPO)를 통해 명시적으로 간결성을 강제하고 효율성을 개선합니다. 특히, ConciseR은 "걷기 전에 뛰지 않는다"는 원칙에 따라 샘플의 모든 롤아웃이 정확할 때만 응답 길이를 최적화합니다. 광범위한 실험 결과는 더 간결한 CoT 추론 응답을 생성하는 ConciseR 모델이 AIME 2024, MATH-500, AMC 2023, Minerva, Olympiad 벤치마크에서 최신 최첨단 추론 모델들을 제로 RL 패러다임으로 능가함을 보여줍니다.

English

As test-time scaling becomes a pivotal research frontier in Large Language Models (LLMs) development, contemporary and advanced post-training methodologies increasingly focus on extending the generation length of long Chain-of-Thought (CoT) responses to enhance reasoning capabilities toward DeepSeek R1-like performance. However, recent studies reveal a persistent overthinking phenomenon in state-of-the-art reasoning models, manifesting as excessive redundancy or repetitive thinking patterns in long CoT responses. To address this issue, in this paper, we propose a simple yet effective two-stage reinforcement learning framework for achieving concise reasoning in LLMs, named ConciseR. Specifically, the first stage, using more training steps, aims to incentivize the model's reasoning capabilities via Group Relative Policy Optimization with clip-higher and dynamic sampling components (GRPO++), and the second stage, using fewer training steps, explicitly enforces conciseness and improves efficiency via Length-aware Group Relative Policy Optimization (L-GRPO). Significantly, ConciseR only optimizes response length once all rollouts of a sample are correct, following the "walk before you run" principle. Extensive experimental results demonstrate that our ConciseR model, which generates more concise CoT reasoning responses, outperforms recent state-of-the-art reasoning models with zero RL paradigm across AIME 2024, MATH-500, AMC 2023, Minerva, and Olympiad benchmarks.

달리기 전에 걷자! 강화 학습을 통한 간결한 LLM 추론

Walk Before You Run! Concise LLM Reasoning via Reinforcement Learning

초록

Support