迈向高效工具集成推理:通过自我进化偏好学习
Toward Effective Tool-Integrated Reasoning via Self-Evolved Preference Learning
September 27, 2025
作者: Yifei Chen, Guanting Dong, Zhicheng Dou
cs.AI
摘要
工具集成推理(Tool-Integrated Reasoning, TIR)使大型语言模型(LLMs)能够通过整合外部工具来提升其内部推理能力。然而,采用TIR的模型常表现出次优行为,如工具使用不足或过度,以及在工具调用后过度思考。激励LLMs高效且准确地进行TIR,同时稳定推理过程,仍是一个未解难题。本文首先从信息熵的角度探讨了工具调用对模型推理的影响。研究发现,工具调用结果会导致后续推理的信息熵发生显著变化,推理链的整体熵随工具调用次数的不同而变化。基于这些发现,我们提出了Tool-Light框架,旨在鼓励LLMs高效且准确地执行TIR。该框架包括数据集构建和多阶段微调。在数据集构建方面,我们采用微调模型进行连续自演化采样,结合了普通采样和熵引导采样。此外,我们在采样过程中为选择正负样本对设定了严格标准。训练过程采用两阶段方法,包括监督微调(Supervised Fine-Tuning, SFT)和自演化直接偏好优化(Self-Evolved Direct Preference Optimization, DPO)。在10个数据集上的实验结果表明,Tool-Light有效提升了模型执行TIR任务的效率。
English
Tool-Integrated Reasoning (TIR) enables large language models (LLMs) to
improve their internal reasoning ability by integrating external tools.
However, models employing TIR often display suboptimal behaviors, such as
insufficient or excessive tool usage and overthinking after tool calls. The
challenge of incentivizing LLMs to perform TIR efficiently and accurately,
while stabilizing the reasoning process, remains an open question. In this
paper, we start by exploring the impact of tool calls on model reasoning from
the perspective of information entropy. Our findings indicate that tool call
results lead to a distinct change in the information entropy of subsequent
reasoning, with the overall entropy of the reasoning chain varying based on the
number of tool calls. Building on these insights, we propose Tool-Light, a
framework designed to encourage LLMs to perform TIR efficiently and accurately.
Our framework includes dataset construction and multi-stage fine-tuning. For
dataset construction, we employ continuous self-evolved sampling using the
fine-tuned model, integrating both vanilla sampling and entropy-guided
sampling. Besides, we establish strict criteria for selecting positive-negative
pairs during sampling. The training process involves a two-stage approach,
comprising Supervised Fine-Tuning (SFT) and Self-Evolved Direct Preference
Optimization (DPO). Experimental results on 10 datasets demonstrate the
effectiveness of Tool-Light, significantly improving the model's efficiency in
executing TIR tasks.