ChatPaper.aiChatPaper

SIRI:交錯壓縮下的迭代強化學習擴展

SIRI: Scaling Iterative Reinforcement Learning with Interleaved Compression

September 29, 2025
作者: Haoming Wen, Yushi Bai, Juanzi Li, Jie Tang
cs.AI

摘要

我們介紹了SIRI(Scaling Iterative Reinforcement Learning with Interleaved Compression),這是一種針對大型推理模型(LRMs)的簡單而有效的強化學習方法,旨在實現更高效且精確的推理。現有研究已觀察到LRMs中存在重複的思維模式,而嘗試減少這些模式往往以性能下降為代價。本文中,我們展示了一種訓練策略,通過在訓練過程中動態調整最大展開長度,迭代地在壓縮與擴展推理預算之間交替,從而克服這一權衡。壓縮階段縮短展開長度,迫使模型在有限上下文中做出精確且有價值的決策,有效減少冗餘標記並提高推理密度。擴展階段則放寬長度限制,為模型提供在長視野設置中探索與規劃的空間。值得注意的是,我們發現每次壓縮-擴展循環後,模型的性能提升,即使其輸出長度減少,穩步推動其接近性能-效率權衡的帕累托前沿。在DeepSeek-R1-Distill-Qwen-1.5B上訓練,SIRI-low在AIME24上的性能提升了43.2%,同時在三次迭代後減少了46.9%的標記使用量,而SIRI-high相比所有其他方法達到了最高準確率(圖1)。我們的研究揭示了在訓練期間週期性振盪LRM輸出截斷長度的潛力,以動態平衡推理中的探索與效率,收斂於兩者之間的最佳“甜蜜點”。我們的模型已公開提供。
English
We introduce SIRI, Scaling Iterative Reinforcement Learning with Interleaved Compression, a simple yet effective RL approach for Large Reasoning Models (LRMs) that enables more efficient and accurate reasoning. Existing studies have observed repetitive thinking patterns in LRMs, and attempts to reduce them often come at the cost of performance. In this paper, we show that this trade-off can be overcome through a training regime that iteratively alternates between compressing and expanding the reasoning budget, by dynamically adjusting the maximum rollout length during training. The compression phase cuts the rollout length, forcing the model to make precise and valuable decisions within a limited context, which effectively reduces redundant tokens and increases reasoning density. The expansion phase then relaxes the length limit, providing space for the model to explore and plan in long-horizon settings. Remarkably, we find that after each compression-expansion cycle, the model's performance improves even as its output length decreases, steadily pushing it closer to the Pareto frontier in the performance-efficiency trade-off. Training on DeepSeek-R1-Distill-Qwen-1.5B, SIRI-low improves performance on AIME24 by 43.2% while reducing token usage by 46.9% after three iterations, and SIRI-high achieves the highest accuracy compared to all other methods (Figure 1). Our findings shed light on the potential of periodically oscillating the LRM's output truncation length during training to dynamically balance exploration and efficiency in reasoning, converging towards an optimal "sweet spot" between the two. Our models are publicly available.
PDF112September 30, 2025