InfiniteThink+: 基于强化学习的有效高效无限视野推理

摘要

大型推理模型通过扩展推理时的思维链实现强大性能，但这种范式存在二次成本增长、上下文长度限制以及因"中间信息丢失效应"导致的推理质量下降等问题。迭代式推理通过周期性总结中间思路来缓解这些问题，然而现有方法依赖监督学习或固定启发式规则，无法优化总结时机、信息保留内容和推理重启策略。我们提出InftyThink+——基于模型控制的迭代边界与显式摘要机制的端到端强化学习框架，可优化完整迭代推理轨迹。该框架采用监督式冷启动与轨迹级强化学习相结合的两阶段训练方案，使模型学会战略性摘要生成与推理续接决策。在DeepSeek-R1-Distill-Qwen-1.5B上的实验表明，InftyThink+在AIME24上的准确率提升21%，显著优于传统长思维链强化学习方法，同时在分布外基准测试中展现出更强泛化能力。此外，该框架大幅降低推理延迟并加速强化学习训练，在提升推理效率的同时实现了更优的性能表现。

English

Large reasoning models achieve strong performance by scaling inference-time chain-of-thought, but this paradigm suffers from quadratic cost, context length limits, and degraded reasoning due to lost-in-the-middle effects. Iterative reasoning mitigates these issues by periodically summarizing intermediate thoughts, yet existing methods rely on supervised learning or fixed heuristics and fail to optimize when to summarize, what to preserve, and how to resume reasoning. We propose InftyThink+, an end-to-end reinforcement learning framework that optimizes the entire iterative reasoning trajectory, building on model-controlled iteration boundaries and explicit summarization. InftyThink+ adopts a two-stage training scheme with supervised cold-start followed by trajectory-level reinforcement learning, enabling the model to learn strategic summarization and continuation decisions. Experiments on DeepSeek-R1-Distill-Qwen-1.5B show that InftyThink+ improves accuracy by 21% on AIME24 and outperforms conventional long chain-of-thought reinforcement learning by a clear margin, while also generalizing better to out-of-distribution benchmarks. Moreover, InftyThink+ significantly reduces inference latency and accelerates reinforcement learning training, demonstrating improved reasoning efficiency alongside stronger performance.