SIRI: インターリーブ圧縮を伴う反復強化学習のスケーリング

要旨

本論文では、大規模推論モデル（LRM）に対してより効率的かつ正確な推論を可能にする、シンプルでありながら効果的な強化学習（RL）アプローチであるSIRI（Scaling Iterative Reinforcement Learning with Interleaved Compression）を紹介する。既存の研究では、LRMにおいて繰り返しの思考パターンが観察されており、これを削減しようとする試みはしばしば性能の低下を伴う。本論文では、このトレードオフを克服するために、トレーニング中に最大ロールアウト長を動的に調整することで、推論予算を圧縮と拡張の間で交互に繰り返すトレーニング体制を提案する。圧縮フェーズではロールアウト長を短縮し、モデルに限られた文脈内で正確かつ価値のある意思決定を強制することで、冗長なトークンを効果的に削減し、推論密度を高める。拡張フェーズでは長さ制限を緩和し、モデルが長期的な設定で探索と計画を行うための空間を提供する。注目すべきは、各圧縮-拡張サイクルの後、モデルの出力長が減少するにもかかわらず性能が向上し、性能と効率のトレードオフにおけるパレートフロンティアに着実に近づくことである。DeepSeek-R1-Distill-Qwen-1.5Bでのトレーニングにおいて、SIRI-lowは3回の反復後にAIME24での性能を43.2%向上させ、トークン使用量を46.9%削減し、SIRI-highは他のすべての手法と比較して最高の精度を達成した（図1）。我々の研究結果は、トレーニング中にLRMの出力切り捨て長を周期的に振動させることで、推論における探索と効率を動的にバランスさせ、両者の間の最適な「スイートスポット」に収束させる可能性を示唆している。我々のモデルは公開されている。

English

We introduce SIRI, Scaling Iterative Reinforcement Learning with Interleaved Compression, a simple yet effective RL approach for Large Reasoning Models (LRMs) that enables more efficient and accurate reasoning. Existing studies have observed repetitive thinking patterns in LRMs, and attempts to reduce them often come at the cost of performance. In this paper, we show that this trade-off can be overcome through a training regime that iteratively alternates between compressing and expanding the reasoning budget, by dynamically adjusting the maximum rollout length during training. The compression phase cuts the rollout length, forcing the model to make precise and valuable decisions within a limited context, which effectively reduces redundant tokens and increases reasoning density. The expansion phase then relaxes the length limit, providing space for the model to explore and plan in long-horizon settings. Remarkably, we find that after each compression-expansion cycle, the model's performance improves even as its output length decreases, steadily pushing it closer to the Pareto frontier in the performance-efficiency trade-off. Training on DeepSeek-R1-Distill-Qwen-1.5B, SIRI-low improves performance on AIME24 by 43.2% while reducing token usage by 46.9% after three iterations, and SIRI-high achieves the highest accuracy compared to all other methods (Figure 1). Our findings shed light on the potential of periodically oscillating the LRM's output truncation length during training to dynamically balance exploration and efficiency in reasoning, converging towards an optimal "sweet spot" between the two. Our models are publicly available.

SIRI: インターリーブ圧縮を伴う反復強化学習のスケーリング

SIRI: Scaling Iterative Reinforcement Learning with Interleaved Compression

要旨

Support