SimpleTIR: 다중 턴 도구 통합 추론을 위한 종단 간 강화 학습

초록

대형 언어 모델(LLMs)은 외부 도구와 상호작용함으로써 추론 능력을 크게 향상시킬 수 있으며, 이를 도구 통합 추론(Tool-Integrated Reasoning, TIR)이라고 합니다. 그러나 강화 학습(Reinforcement Learning, RL)을 사용하여 TIR을 다중 턴 시나리오로 확장하는 것은 종종 훈련 불안정성과 성능 저하를 초래합니다. 우리는 이러한 불안정성이 주로 외부 도구 피드백으로 인한 분포 변화(distributional drift)에 기인하며, 이로 인해 낮은 확률의 토큰이 생성된다는 것을 확인했습니다. 이 문제는 연속적인 턴에 걸쳐 누적되며, 치명적인 그래디언트 노름 폭발(gradient norm explosion)을 일으켜 훈련 과정을 방해합니다. 이러한 문제를 해결하기 위해, 우리는 다중 턴 TIR 훈련을 안정화시키는 플러그 앤 플레이 알고리즘인 SimpleTIR을 소개합니다. SimpleTIR의 핵심 전략은 코드 블록이나 최종 답변을 생성하지 못하는 무효 턴(void turns)을 포함한 궤적(trajectories)을 식별하고 필터링하는 것입니다. 이러한 문제가 있는 궤적을 정책 업데이트에서 제거함으로써, SimpleTIR은 유해한 고강도 그래디언트를 차단하여 학습 역학을 안정화시킵니다. 광범위한 실험을 통해 SimpleTIR이 어려운 수학 추론 벤치마크에서 최첨단 성능을 달성하며, 특히 Qwen2.5-7B 기본 모델에서 시작할 때 텍스트 전용 기준선인 22.1에서 AIME24 점수를 50.5로 크게 향상시킨다는 것을 보여줍니다. 더 나아가, SimpleTIR은 지도 미세 조정(supervised fine-tuning)의 제약을 피함으로써 모델이 자기 수정(self-correction) 및 교차 검증(cross-validation)과 같은 다양하고 정교한 추론 패턴을 발견하도록 장려합니다.

English

Large Language Models (LLMs) can significantly improve their reasoning capabilities by interacting with external tools, a paradigm known as Tool-Integrated Reasoning (TIR). However, extending TIR to multi-turn scenarios using Reinforcement Learning (RL) is often hindered by training instability and performance collapse. We identify that such instability is primarily caused by a distributional drift from external tool feedback, leading to the generation of low-probability tokens. This issue compounds over successive turns, causing catastrophic gradient norm explosions that derail the training process. To address this challenge, we introduce SimpleTIR , a plug-and-play algorithm that stabilizes multi-turn TIR training. Its core strategy is to identify and filter out trajectories containing void turns, i.e., turns that yield neither a code block nor a final answer. By removing these problematic trajectories from the policy update, SimpleTIR effectively blocks the harmful, high-magnitude gradients, thus stabilizing the learning dynamics. Extensive experiments show that SimpleTIR achieves state-of-the-art performance on challenging math reasoning benchmarks, notably elevating the AIME24 score from a text-only baseline of 22.1 to 50.5 when starting from the Qwen2.5-7B base model. Furthermore, by avoiding the constraints of supervised fine-tuning, SimpleTIR encourages the model to discover diverse and sophisticated reasoning patterns, such as self-correction and cross-validation.

SimpleTIR: 다중 턴 도구 통합 추론을 위한 종단 간 강화 학습

SimpleTIR: End-to-End Reinforcement Learning for Multi-Turn Tool-Integrated Reasoning

초록

Support