SIRI: 인터리브 압축을 통한 반복 강화 학습의 확장

초록

우리는 대규모 추론 모델(Large Reasoning Models, LRMs)을 위한 간단하면서도 효과적인 강화 학습 접근법인 SIRI(Scaling Iterative Reinforcement Learning with Interleaved Compression)를 소개합니다. 이 방법은 더 효율적이고 정확한 추론을 가능하게 합니다. 기존 연구에서는 LRMs에서 반복적인 사고 패턴이 관찰되었으며, 이를 줄이려는 시도는 종종 성능 저하를 초래했습니다. 본 논문에서는 훈련 중 최대 롤아웃 길이를 동적으로 조정하여 추론 예산을 압축하고 확장하는 과정을 반복적으로 교차시키는 훈련 방식을 통해 이러한 트레이드오프를 극복할 수 있음을 보여줍니다. 압축 단계에서는 롤아웃 길이를 줄여 모델이 제한된 맥락 내에서 정확하고 가치 있는 결정을 내리도록 강제함으로써 불필요한 토큰을 효과적으로 줄이고 추론 밀도를 높입니다. 확장 단계에서는 길이 제한을 완화하여 모델이 장기적인 설정에서 탐색하고 계획할 수 있는 공간을 제공합니다. 주목할 만한 점은, 각 압축-확장 주기 이후 모델의 성능이 개선되면서도 출력 길이가 감소하여 성능-효율성 트레이드오프에서 파레토 프론티어에 점점 더 가까워진다는 것입니다. DeepSeek-R1-Distill-Qwen-1.5B에서 훈련한 결과, SIRI-low는 세 번의 반복 후 AIME24에서 성능을 43.2% 향상시키면서 토큰 사용량을 46.9% 줄였으며, SIRI-high는 다른 모든 방법과 비교하여 가장 높은 정확도를 달성했습니다(그림 1). 우리의 연구 결과는 훈련 중 LRM의 출력 잘림 길이를 주기적으로 조정하여 추론에서 탐색과 효율성을 동적으로 균형 잡고, 두 가지 사이의 최적의 "스위트 스팟"으로 수렴할 수 있는 잠재력을 밝혀냈습니다. 우리의 모델은 공개적으로 이용 가능합니다.

English

We introduce SIRI, Scaling Iterative Reinforcement Learning with Interleaved Compression, a simple yet effective RL approach for Large Reasoning Models (LRMs) that enables more efficient and accurate reasoning. Existing studies have observed repetitive thinking patterns in LRMs, and attempts to reduce them often come at the cost of performance. In this paper, we show that this trade-off can be overcome through a training regime that iteratively alternates between compressing and expanding the reasoning budget, by dynamically adjusting the maximum rollout length during training. The compression phase cuts the rollout length, forcing the model to make precise and valuable decisions within a limited context, which effectively reduces redundant tokens and increases reasoning density. The expansion phase then relaxes the length limit, providing space for the model to explore and plan in long-horizon settings. Remarkably, we find that after each compression-expansion cycle, the model's performance improves even as its output length decreases, steadily pushing it closer to the Pareto frontier in the performance-efficiency trade-off. Training on DeepSeek-R1-Distill-Qwen-1.5B, SIRI-low improves performance on AIME24 by 43.2% while reducing token usage by 46.9% after three iterations, and SIRI-high achieves the highest accuracy compared to all other methods (Figure 1). Our findings shed light on the potential of periodically oscillating the LRM's output truncation length during training to dynamically balance exploration and efficiency in reasoning, converging towards an optimal "sweet spot" between the two. Our models are publicly available.

SIRI: 인터리브 압축을 통한 반복 강화 학습의 확장

SIRI: Scaling Iterative Reinforcement Learning with Interleaved Compression

초록

Support