OTC: 강화 학습을 통한 최적의 도구 호출

초록

도구 통합 추론(Tool-integrated Reasoning, TIR)은 대규모 언어 모델(LLMs)에 검색 엔진이나 코드 인터프리터와 같은 외부 도구를 호출할 수 있는 능력을 부여하여, 언어만으로는 해결할 수 없는 과제를 수행할 수 있게 확장합니다. 강화 학습(Reinforcement Learning, RL)은 최종 답변의 정확성을 최적화함으로써 TIR을 개선하는 데 유망한 결과를 보여주었지만, 기존 접근 방식들은 도구 사용과 관련된 효율성과 비용을 종종 간과합니다. 이는 계산 및 재정적 부담을 증가시키는 과도한 도구 호출이나 답변 품질을 저해하는 불충분한 도구 사용과 같은 최적이 아닌 행동으로 이어질 수 있습니다. 본 연구에서는 정확한 답변을 최소한의 도구 호출로 생성하도록 모델을 유도하는 간단하면서도 효과적인 RL 기반 프레임워크인 최적 도구 호출 제어 정책 최적화(Optimal Tool Call-controlled Policy Optimization, OTC-PO)를 제안합니다. 우리의 방법은 정확성과 도구 효율성을 동시에 고려하는 도구 통합 보상을 도입하여 높은 도구 생산성을 촉진합니다. 이 프레임워크를 Proximal Policy Optimization(PPO)과 Group Relative Preference Optimization(GRPO) 내에서 구현하여 OTC-PPO와 OTC-GRPO를 개발했습니다. Qwen-2.5과 Qwen-Math를 사용한 여러 QA 벤치마크 실험 결과, 우리의 접근 방식은 도구 호출을 최대 73.1%까지 줄이고 도구 생산성을 최대 229.4%까지 향상시키면서도 비슷한 수준의 답변 정확도를 유지하는 것으로 나타났습니다. 우리가 아는 한, 이는 TIR에서 도구 사용 효율성을 명시적으로 최적화하는 첫 번째 RL 기반 프레임워크입니다.

English

Tool-integrated reasoning (TIR) augments large language models (LLMs) with the ability to invoke external tools, such as search engines and code interpreters, to solve tasks beyond the capabilities of language-only reasoning. While reinforcement learning (RL) has shown promise in improving TIR by optimizing final answer correctness, existing approaches often overlook the efficiency and cost associated with tool usage. This can lead to suboptimal behavior, including excessive tool calls that increase computational and financial overhead, or insufficient tool use that compromises answer quality. In this work, we propose Optimal Tool Call-controlled Policy Optimization (OTC-PO), a simple yet effective RL-based framework that encourages models to produce accurate answers with minimal tool calls. Our method introduces a tool-integrated reward that jointly considers correctness and tool efficiency, promoting high tool productivity. We instantiate this framework within both Proximal Policy Optimization (PPO) and Group Relative Preference Optimization (GRPO), resulting in OTC-PPO and OTC-GRPO. Experiments with Qwen-2.5 and Qwen-Math across multiple QA benchmarks show that our approach reduces tool calls by up to 73.1\% and improves tool productivity by up to 229.4\%, while maintaining comparable answer accuracy. To the best of our knowledge, this is the first RL-based framework that explicitly optimizes tool-use efficiency in TIR.

OTC: 강화 학습을 통한 최적의 도구 호출

OTC: Optimal Tool Calls via Reinforcement Learning

초록

Support