효율적인 도구 통합 추론을 위한 자기 진화적 선호 학습

초록

도구 통합 추론(Tool-Integrated Reasoning, TIR)은 대형 언어 모델(LLM)이 외부 도구를 통합하여 내부 추론 능력을 향상시킬 수 있도록 합니다. 그러나 TIR을 사용하는 모델들은 종종 도구 사용이 부족하거나 과도하며, 도구 호출 후 과도한 사고를 보이는 등 최적이 아닌 행동을 보입니다. LLM이 TIR을 효율적이고 정확하게 수행하도록 유도하면서 추론 과정을 안정화시키는 문제는 여전히 해결되지 않은 과제입니다. 본 논문에서는 정보 엔트로피 관점에서 도구 호출이 모델 추론에 미치는 영향을 탐구하는 것으로 시작합니다. 연구 결과에 따르면, 도구 호출 결과는 후속 추론의 정보 엔트로피에 뚜렷한 변화를 일으키며, 추론 체인의 전체 엔트로피는 도구 호출 횟수에 따라 달라집니다. 이러한 통찰을 바탕으로, 우리는 LLM이 TIR을 효율적이고 정확하게 수행하도록 장려하기 위해 Tool-Light 프레임워크를 제안합니다. 이 프레임워크는 데이터셋 구축과 다단계 미세 조정을 포함합니다. 데이터셋 구축을 위해, 미세 조정된 모델을 사용한 연속적 자기 진화 샘플링을 적용하며, 기본 샘플링과 엔트로피 기반 샘플링을 통합합니다. 또한, 샘플링 과정에서 엄격한 기준을 설정하여 긍정-부정 쌍을 선택합니다. 훈련 과정은 지도 미세 조정(Supervised Fine-Tuning, SFT)과 자기 진화 직접 선호 최적화(Self-Evolved Direct Preference Optimization, DPO)의 두 단계로 구성됩니다. 10개의 데이터셋에서의 실험 결과는 Tool-Light가 TIR 작업을 수행하는 모델의 효율성을 크게 향상시키는 효과를 입증합니다.

English

Tool-Integrated Reasoning (TIR) enables large language models (LLMs) to improve their internal reasoning ability by integrating external tools. However, models employing TIR often display suboptimal behaviors, such as insufficient or excessive tool usage and overthinking after tool calls. The challenge of incentivizing LLMs to perform TIR efficiently and accurately, while stabilizing the reasoning process, remains an open question. In this paper, we start by exploring the impact of tool calls on model reasoning from the perspective of information entropy. Our findings indicate that tool call results lead to a distinct change in the information entropy of subsequent reasoning, with the overall entropy of the reasoning chain varying based on the number of tool calls. Building on these insights, we propose Tool-Light, a framework designed to encourage LLMs to perform TIR efficiently and accurately. Our framework includes dataset construction and multi-stage fine-tuning. For dataset construction, we employ continuous self-evolved sampling using the fine-tuned model, integrating both vanilla sampling and entropy-guided sampling. Besides, we establish strict criteria for selecting positive-negative pairs during sampling. The training process involves a two-stage approach, comprising Supervised Fine-Tuning (SFT) and Self-Evolved Direct Preference Optimization (DPO). Experimental results on 10 datasets demonstrate the effectiveness of Tool-Light, significantly improving the model's efficiency in executing TIR tasks.

효율적인 도구 통합 추론을 위한 자기 진화적 선호 학습

Toward Effective Tool-Integrated Reasoning via Self-Evolved Preference Learning

초록

Support