InftyThink+: 강화 학습을 통한 효과적이고 효율적인 무한 시간 범위 추론

초록

대규모 추론 모델은 추론 시점 체인 오브 씽킹(chain-of-thought)의 규모를 확장하여 강력한 성능을 달성하지만, 이 패러다임은 2차 비용 증가, 컨텍스트 길이 제한, 그리고 중간 정보 손실 효과(lost-in-the-middle effects)로 인한 추론 성능 저하라는 문제점을 안고 있습니다. 반복적 추론(iterative reasoning)은 중간 생각을 주기적으로 요약함으로써 이러한 문제를 완화하지만, 기존 방법들은 지도 학습이나 고정 휴리스틱에 의존하여 언제 요약할지, 무엇을 보존할지, 어떻게 추론을 재개할지 최적화하지 못합니다. 우리는 모델이 제어하는 반복 경계와 명시적 요약을 바탕으로 전체 반복 추론 궤적을 최적화하는 종단 간 강화 학습 프레임워크인 InftyThink+를 제안합니다. InftyThink+는 지도 학습을 통한 콜드 스타트 후 궤적 수준의 강화 학습을 수행하는 2단계 학습 방식을 채택하여 모델이 전략적 요약 및 추론 재개 결정을 학습할 수 있게 합니다. DeepSeek-R1-Distill-Qwen-1.5B 모델을 이용한 실험에서 InftyThink+는 AIME24에서 정확도를 21% 향상시켰으며, 기존의 장기 체인 오브 씽킹 강화 학습 방법을 명확한 차이로 능가하는 동시에 분포 외 벤치마크에서도 더 나은 일반화 성능을 보였습니다. 더불어 InftyThink+는 추론 지연 시간을 현저히 줄이고 강화 학습 훈련 속도를 가속화하여 향상된 추론 효율성과 더불어 강화된 성능을 입증했습니다.

English

Large reasoning models achieve strong performance by scaling inference-time chain-of-thought, but this paradigm suffers from quadratic cost, context length limits, and degraded reasoning due to lost-in-the-middle effects. Iterative reasoning mitigates these issues by periodically summarizing intermediate thoughts, yet existing methods rely on supervised learning or fixed heuristics and fail to optimize when to summarize, what to preserve, and how to resume reasoning. We propose InftyThink+, an end-to-end reinforcement learning framework that optimizes the entire iterative reasoning trajectory, building on model-controlled iteration boundaries and explicit summarization. InftyThink+ adopts a two-stage training scheme with supervised cold-start followed by trajectory-level reinforcement learning, enabling the model to learn strategic summarization and continuation decisions. Experiments on DeepSeek-R1-Distill-Qwen-1.5B show that InftyThink+ improves accuracy by 21% on AIME24 and outperforms conventional long chain-of-thought reinforcement learning by a clear margin, while also generalizing better to out-of-distribution benchmarks. Moreover, InftyThink+ significantly reduces inference latency and accelerates reinforcement learning training, demonstrating improved reasoning efficiency alongside stronger performance.

InftyThink+: 강화 학습을 통한 효과적이고 효율적인 무한 시간 범위 추론

InftyThink+: Effective and Efficient Infinite-Horizon Reasoning via Reinforcement Learning

초록

Support