도구 통합 추론 이해하기

초록

우리는 도구 통합 추론(Tool-Integrated Reasoning, TIR)이 대규모 언어 모델(LLM)의 능력을 어떻게 향상시키는지 연구한다. 파이썬 코드 인터프리터와 같은 도구와 통합된 LLM은 큰 잠재력을 보여주지만, 이러한 패러다임이 효과적인 이유를 설명하는 원칙적인 이론이 부족했다. 본 연구는 TIR이 근본적으로 LLM의 능력을 확장한다는 첫 번째 형식적 증명을 제공한다. 우리는 도구가 모델의 경험적 및 실행 가능한 지원 영역을 엄격하게 확장함으로써, 순수 텍스트 모델의 능력 한계를 극복하고, 그렇지 않으면 불가능하거나 지나치게 장황한 문제 해결 전략을 가능하게 한다는 것을 보여준다. 또한, 훈련 안정성과 성능을 저해하지 않으면서 모델 행동을 유도하기 위해, 이점 함수를 직접 수정하여 정책 행동을 유도하는 새로운 알고리즘인 Advantage Shaping Policy Optimization(ASPO)을 소개한다. 우리는 외부 도구로 파이썬 인터프리터를 활용하여 도전적인 수학 벤치마크에서 포괄적인 실험을 수행한다. 실험 결과, TIR 모델이 순수 텍스트 모델에 비해 pass@k 지표에서 결정적으로 우수한 성능을 보인다. 특히, 이러한 이점은 계산 집약적인 문제에 국한되지 않고 상당한 추상적 통찰을 요구하는 문제로까지 확장된다. 또한, 모델이 도구와 함께 사고하는 방법을 보여주는 새로운 인지 패턴을 식별한다. 마지막으로, ASPO를 통해 초기 코드 호출과 훨씬 더 상호작용적인 턴을 통해 개선된 도구 사용 행동을 보고한다. 전반적으로, 우리의 연구는 TIR의 성공에 대한 첫 번째 원칙적 설명을 제공하며, 도구가 단순히 작동한다는 사실에서 벗어나 왜 그리고 어떻게 더 강력한 추론을 가능하게 하는지에 초점을 맞춘다.

English

We study why Tool-Integrated Reasoning (TIR) makes Large Language Models (LLMs) more capable. While LLMs integrated with tools like Python code interpreters show great promise, a principled theory explaining why this paradigm is effective has been missing. This work provides the first formal proof that TIR fundamentally expands an LLM's capabilities. We demonstrate that tools enable a strict expansion of the model's empirical and feasible support, breaking the capability ceiling of pure-text models by unlocking problem-solving strategies that are otherwise impossible or intractably verbose. To guide model behavior without compromising training stability and performance, we also introduce Advantage Shaping Policy Optimization (ASPO), a novel algorithm that directly modifies the advantage function to guide the policy behavior. We conduct comprehensive experiments on challenging mathematical benchmarks, leveraging a Python interpreter as the external tool. Our results show that the TIR model decisively outperforms its pure-text counterpart on the pass@k metric. Crucially, this advantage is not confined to computationally-intensive problems but extends to those requiring significant abstract insight. We further identify the emergent cognitive patterns that illustrate how models learn to think with tools. Finally, we report improved tool usage behavior with early code invocation and much more interactive turns with ASPO. Overall, our work provides the first principled explanation for TIR's success, shifting the focus from the mere fact that tools work to why and how they enable more powerful reasoning.

도구 통합 추론 이해하기

Understanding Tool-Integrated Reasoning

초록

Support