InftyThink+: 強化学習による効果的かつ効率的な無限時間軸推論

要旨

大規模推論モデルは、推論時の連鎖的思考（Chain-of-Thought）の規模拡大によって高い性能を達成するが、このパラダイムは二次コストの増大、コンテキスト長の制約、および「中間喪失効果」による推論品質の低下に悩まされている。反復的推論は、中間思考を定期的に要約することでこれらの問題を緩和するが、既存手法は教師あり学習または固定ヒューリスティックに依存し、いつ要約するか、何を保持するか、どのように推論を再開するかが最適化されていない。本論文では、モデル制御の反復境界と明示的要約に基づき、反復的推論軌道全体を最適化するエンドツーエンド強化学習フレームワーク、InftyThink+を提案する。InftyThink+は、教師あり学習によるコールドスタートと軌道レベルの強化学習からなる2段階トレーニング方式を採用し、戦略的要約と継続判断をモデルに学習させる。DeepSeek-R1-Distill-Qwen-1.5Bを用いた実験により、InftyThink+はAIME24において精度を21%向上させ、従来の長い連鎖的思考強化学習を明確に上回るだけでなく、分布外ベンチマークへの一般化性能も高めることを示す。さらに、InftyThink+は推論遅延を大幅に削減し、強化学習のトレーニングを加速させ、性能向上とともに推論効率の改善も実証する。

English

Large reasoning models achieve strong performance by scaling inference-time chain-of-thought, but this paradigm suffers from quadratic cost, context length limits, and degraded reasoning due to lost-in-the-middle effects. Iterative reasoning mitigates these issues by periodically summarizing intermediate thoughts, yet existing methods rely on supervised learning or fixed heuristics and fail to optimize when to summarize, what to preserve, and how to resume reasoning. We propose InftyThink+, an end-to-end reinforcement learning framework that optimizes the entire iterative reasoning trajectory, building on model-controlled iteration boundaries and explicit summarization. InftyThink+ adopts a two-stage training scheme with supervised cold-start followed by trajectory-level reinforcement learning, enabling the model to learn strategic summarization and continuation decisions. Experiments on DeepSeek-R1-Distill-Qwen-1.5B show that InftyThink+ improves accuracy by 21% on AIME24 and outperforms conventional long chain-of-thought reinforcement learning by a clear margin, while also generalizing better to out-of-distribution benchmarks. Moreover, InftyThink+ significantly reduces inference latency and accelerates reinforcement learning training, demonstrating improved reasoning efficiency alongside stronger performance.

InftyThink+: 強化学習による効果的かつ効率的な無限時間軸推論

InftyThink+: Effective and Efficient Infinite-Horizon Reasoning via Reinforcement Learning

要旨

Support